diff --git "a/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" "b/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
index 3b92d8a340285b70b9302b9b879d454c1f508bf5..158d9f340b50c8a51feb6f4de8749669bcf6fe58 100644
--- "a/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
+++ "b/thirdparty/skynoon/user_cases/AI-BOX-MTSDK-1.3.0/AIBOX-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
@@ -161,34 +161,7 @@ True
 
 # 常见问题
 
-## Q1：libmpi_cxx.so.40: cannot open shared object file: No such file or directory
-
-```shell
-(vllm_musa) dev@localhost:~/model$ 
-(vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())"
-Traceback (most recent call last):
-  File "<string>", line 1, in <module>
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 236, in <module>
-    _load_global_deps()
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 195, in _load_global_deps
-    raise err
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 176, in _load_global_deps
-    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/ctypes/__init__.py", line 374, in __init__
-    self._handle = _dlopen(self._name, mode)
-OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory
-```
-
-**解决方案**
-
-证明mpi so缺失，请下载以下依赖：
-
-```shell
-sudo apt update
-
-```
-
-## Q2：NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
+## Q1：NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
 
 ```python
 A module that was compiled using NumPy 1.x cannot be run in
@@ -224,7 +197,7 @@ conda activate vllm_musa
 pip3 install numpy==1.26.4
 ```
 
-## Q3：ImportError: Please try running Python from a different directory!
+## Q2：ImportError: Please try running Python from a different directory!
 
 ```python
 (vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())"
@@ -258,13 +231,13 @@ pip3 uninstall torch torch_musa torchaudio torchvision
 
 **注意：如果还不行，请重新安装一次musa环境，点击跳转：[环境安装](#envInstall)，安装完成后再将torch相关包重新安装一次**
 
-## Q4：ImportError: libmccl.so.2: cannot open shared object file: No such file or directory
+## Q3：ImportError: libmccl.so.2: cannot open shared object file: No such file or directory
 
 **解决方案**
 
 请重新安装一次musa环境，点击跳转：[环境安装](#envInstall)，安装完成后再将torch相关包重新安装一次
 
-## Q5： MUSA driver initialization failed
+## Q4： MUSA driver initialization failed
 
 ```shell
 Traceback (most recent call last):
@@ -330,8 +303,8 @@ conda activate vllm_musa
 cd ${packagedirectory}
 
 pip3 install -r requirements.txt
-pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl 
-pip3 install vllm-0.9.2.dev257+g4747b491f-cp310-cp310-linux_aarch64.whl
+pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl --force-reinstall
+pip3 install vllm-0.9.2.dev259+gc2cd4356d-cp310-cp310-linux_aarch64.whl
 pip3 install vllm_musa-1.3+m1000-cp310-cp310-linux_aarch64.whl
 ```
 
@@ -354,13 +327,42 @@ Error in cpuinfo: prctl(PR_SVE_GET_VL) failed  # 此处正常
 
 对于量化模型，我们提供以下加速版模型：
 
-| 模型名称                         | 魔塔地址                                                     | git克隆地址                                                  |
-| -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git |
-| gptq-Qwen2.5-7B-Instruct-v2      | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git |
-| gptq-Qwen2.5-14B-Instruct        | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git |
-| gptq-Qwen3-8B                    | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B       | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git          |
-| Qwen3-30B-A3B-GPTQ-Int4          | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4    | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git   |
+| 类别         | 模型名称                         | 魔塔地址                                                     | git克隆地址                                                  |
+| ------------ | -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| DeepSeek     | gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git |
+|              | DeepSeek-R1-Distill-Qwen-1.5B    | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.git |
+|              | DeepSeek-R1-0528-Qwen3-8B        | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.git |
+| **Qwen2.5**  | Qwen2.5-7B-Instruct-GPTQ-Int4    | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4.git |
+|              | Qwen2.5-14B-Instruct-GPTQ-Int4   | https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4.git |
+|              | Qwen2.5-VL-3B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git    |
+|              | Qwen2.5-7B-Instruct              | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct        | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git       |
+|              | gptq-Qwen2.5-7B-Instruct-v2      | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git |
+|              | gptq-Qwen2.5-14B-Instruct        | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git |
+|              | Qwen2.5-VL-7B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-7B-Instruct.git    |
+|              | Qwen2.5-VL-3B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git    |
+|              | QwQ-32B-GPTQ-Int4                | https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4        | https://www.modelscope.cn/tclf90/qwq-32b-gptq-int4.git       |
+| **Qwen3**    | Qwen3-4B                         | https://modelscope.cn/models/Qwen/Qwen3-4B                   | https://www.modelscope.cn/Qwen/Qwen3-4B.git                  |
+|              | Qwen3-8B                         | https://modelscope.cn/models/Qwen/Qwen3-8B                   | https://www.modelscope.cn/Qwen/Qwen3-8B.git                  |
+|              | gptq-Qwen3-8B                    | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B       | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git          |
+|              | Qwen3-30B-A3B-GPTQ-Int4          | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4    | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git   |
+|              | Qwen3-Embedding-0.6B             | https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B       | https://www.modelscope.cn/Qwen/Qwen3-Embedding-0.6B.git      |
+|              | Qwen3-Embedding-4B               | https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B         | https://www.modelscope.cn/Qwen/Qwen3-Embedding-4B.git        |
+|              | Qwen3-Embedding-8B               | https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B         | https://www.modelscope.cn/Qwen/Qwen3-Embedding-8B.git        |
+|              | Qwen3-Reranker-0.6B              | https://modelscope.cn/models/Qwen/Qwen3-Reranker-0.6B        | https://www.modelscope.cn/Qwen/Qwen3-Reranker-0.6B.git       |
+|              | Qwen3-Reranker-8B                | https://modelscope.cn/models/Qwen/Qwen3-Reranker-8B          | https://www.modelscope.cn/Qwen/Qwen3-Reranker-8B.git         |
+| **OpenBMB**  | MiniCPM-V-4                      | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4             | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4.git            |
+|              | MiniCPM-V-4_5                    | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5           | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4_5.git          |
+|              | MiniCPM4.1-8B                    | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B           | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B.git          |
+|              | MiniCPM4.1-8B-GPTQ               | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B-GPTQ      | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B-GPTQ.git     |
+|              | MiniCPM4-0.5B                    | https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B           | https://www.modelscope.cn/OpenBMB/MiniCPM4-0.5B.git          |
+|              | BitCPM4-1B                       | https://modelscope.cn/models/OpenBMB/BitCPM4-1B              | https://www.modelscope.cn/OpenBMB/BitCPM4-1B.git             |
+|              | BitCPM4-0.5B                     | https://modelscope.cn/models/OpenBMB/BitCPM4-0.5B            | https://www.modelscope.cn/OpenBMB/BitCPM4-0.5B.git           |
+|              | MiniCPM-V-2_6                    | https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6           | https://www.modelscope.cn/OpenBMB/MiniCPM-V-2_6.git          |
+|              | InternVL3_5-8B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-8B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-8B.git       |
+|              | InternVL3_5-4B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-4B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-4B.git       |
+|              | InternVL3_5-2B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-2B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-2B.git       |
+|              | InternVL3_5-1B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-1B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-1B.git       |
+| **InfiniAI** | Megrez-3b-Instruct               | https://modelscope.cn/models/InfiniAI/Megrez-3b-Instruct     | https://www.modelscope.cn/InfiniAI/Megrez-3b-Instruct.git    |
 
 **PS**：如果对量化模型有速度要求，需要经过模型转换，模型转换⼯具后期会开放  
 
diff --git "a/thirdparty/skynoon/user_cases/AI-Book-MTSDK-1.3.0/AIBOOK-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" "b/thirdparty/skynoon/user_cases/AI-Book-MTSDK-1.3.0/AIBOOK-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
new file mode 100644
index 0000000000000000000000000000000000000000..22915d011e17dce9b626185bdd7c06c697c7d87f
--- /dev/null
+++ "b/thirdparty/skynoon/user_cases/AI-Book-MTSDK-1.3.0/AIBOOK-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
@@ -0,0 +1,877 @@
+<p align="center">
+	<img alt="logo" src="https://bob-markdown-picture.oss-cn-shenzhen.aliyuncs.com/markdown-img/42850279.png" style="max-height: 300px; max-width: 300px;">
+</p>
+<h1 align="center" style="margin: 30px 0 30px; font-weight: bold;">摩尔线程-AIBOOK</h1>
+<h4 align="center">基于vllm-musa安装模型指引</h4>
+
+<div style = "text-align:right; font-size:13px">
+    <p>lastAuthor：Roger.ye</p> 
+    <p>lastDate：2026-03-11</p>
+</div>
+
+
+
+
+
+
+
+
+# 准备工作
+
+## 工作目录
+
+为了方便后续操作，我们先约定一个工作目录。在接下来的指引中，我们将用 **`${WorkDir}`** 表示这个目录的路径。请您在开始前先**创建好您的工作目录**，并确保后续所有操作都在此目录下进行。
+
+## 系统环境要求
+
+| 组件           | 版本        |
+| -------------- | ----------- |
+| MUSA Driver    | 3.1.3-AB100 |
+| MUSA SDK       | 4.1.2-rc2   |
+| AIOS(操作系统) | 1.3.3       |
+
+## 查看musa环境
+
+```shell
+# 查看驱动
+dpkg -l|grep musa
+
+# 输出
+ii  musa              1.3.3-AB100             arm64        Moore Threads MUSA driver [7f92281de]
+ii  musa-sdk          4.1.2-rc2               arm64        Moore Threads MTGPU Software Development Kit
+
+# 查看环境
+ll /usr/local/ |grep musa*
+
+# 输出
+lrwxrwxrwx  1 mt   mt     10 12月 31 14:20 musa -> musa-4.1.2/
+drwxr-xr-x 13 mt   mt   4096  3月 10 12:30 musa-4.1.2/
+```
+
+或
+
+```shell
+# 执行
+musaInfo
+
+# 输出正常则环境配置正确
+compiler: mcc
+--------------------------------------------------------------------------------
+device#                           0
+Name:                             M1000
+pciBusID:                         0x0
+pciDeviceID:                      0x0
+pciDomainID:                      0x0
+multiProcessorCount:              8
+maxThreadsPerMultiProcessor:      6144
+isMultiGpuBoard:                  1
+clockRate:                        0 Mhz
+memoryClockRate:                  0 Mhz
+memoryBusWidth:                   384
+totalGlobalMem:                   31.05 GB
+sharedMemPerMultiprocessor:       72.00 KB
+totalConstMem:                    8192
+sharedMemPerBlock:                72.00 KB
+canMapHostMemory:                 1
+regsPerBlock:                     262144
+warpSize:                         128
+l2CacheSize:                      0
+computeMode:                      0
+maxThreadsPerBlock:               1024
+maxThreadsDim.x:                  1024
+maxThreadsDim.y:                  1024
+maxThreadsDim.z:                  1024
+maxGridSize.x:                    2147483647
+maxGridSize.y:                    2147483647
+maxGridSize.z:                    2147483647
+major:                            2
+minor:                            2
+concurrentKernels:                1
+cooperativeLaunch:                0
+cooperativeMultiDeviceLaunch:     0
+isIntegrated:                     1
+maxTexture1D:                     32768
+maxTexture2D.width:               32768
+maxTexture2D.height:              32768
+maxTexture3D.width:               16384
+maxTexture3D.height:              16384
+maxTexture3D.depth:               2048
+peers:                            
+non-peers:                        device#0 
+
+memInfo.total:                    31.05 GB
+memInfo.free:                     19.02 GB (61%)
+
+```
+
+## 升级系统（若不满足）
+
+AIBOOK系统升级参考以下文档
+
+```
+https://ab-web.aibook.net.cn/WEB/pdf/爱簿智能AIBOOK%20OTA升级指导手册.pdf
+```
+
+**注：**系统升级包含了musa的版本更新，升级完成后无需再另行安装
+
+# 安装miniconda
+
+```shell
+curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
+chmod +x Miniconda3-latest-Linux-aarch64.sh && ./Miniconda3-latest-Linux-aarch64.sh
+# 一路回车到底
+```
+
+# 配置conda环境变量
+
+```shell
+export  PATH=/home/kevin/miniconda3/bin:$PATH  # 配置时/home/kevin/miniconda3/bin该路径请根据自身情况填写
+source ~/.bashrc
+conda init
+source ~/.bashrc
+```
+
+# 创建conda环境
+
+```shell
+conda create -n vllm_musa python=3.10
+```
+
+# 激活conda环境
+
+```shell
+conda activate vllm_musa
+```
+
+# 配置PIP国内镜像源
+
+```shell
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+```
+
+# 安装依赖
+
+```shell
+sudo apt update
+
+sudo apt install -y python3-pip git cmake wget build-essential g++ libstdc++-12-dev libnuma-dev  openmpi-bin openmpi-common libopenblas-dev curl
+```
+
+# 下载安装包合集
+
+```shell
+# 进入工作目录
+cd ${WorkDir}
+
+# 下载
+wget https://mt-ai-data.tos-cn-shanghai.volces.com/vllm_musa/v1.3.1/release_1.3.3/20260302/AIBook-release_1.3.3-vllm_musa_1.3.1-torch_2.1.1.tar.gz
+
+# 解压
+tar zxvf AIBook-release_1.3.3-vllm_musa_1.3.1-torch_2.1.1.tar.gz
+cd AIBook-release_1.3.3-vllm_musa_1.3.1-torch_2.1.1
+```
+
+# <span id='安装 torch_musa依赖'>安装 torch_musa依赖</span>
+
+```shell
+# 进入依赖包目录
+cd ${packagedirectory}
+
+# 确认安装环境
+conda activate vllm_musa
+
+pip3 install torch-2.5.0-cp310-cp310-linux_aarch64.whl
+pip3 install torch_musa-2.1.1-cp310-cp310-linux_aarch64.whl
+pip3 install torchaudio-2.5.0a0+56bc006-cp310-cp310-linux_aarch64.whl
+pip3 install torchvision-0.20.0a0+afc54f7-cp310-cp310-linux_aarch64.whl
+
+# 此处需要保证numpy处于低版本
+pip3 install numpy==1.26.4
+```
+
+# 验证torch_musa
+
+输出 true 证明torch_musa环境安装正确   
+
+```shell
+# 命令
+python3 -c "import torch;import torch_musa;print(torch.musa.is_available())"
+
+# 期望结果
+Error in cpuinfo: prctl(PR_SVE_GET_VL) failed  # 此处正常
+True  
+```
+
+# 常见问题
+
+## Q1：NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
+
+```python
+A module that was compiled using NumPy 1.x cannot be run in
+NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
+versions of NumPy, modules must be compiled with NumPy 2.0.
+Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
+
+If you are a user of the module, the easiest solution will be to
+downgrade to 'numpy<2' or try to upgrade the affected module.
+We expect that some modules will need time to support NumPy 2.
+
+Traceback (most recent call last):  File "<string>", line 1, in <module>
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 1471, in <module>
+    from .functional import *  # noqa: F403
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/functional.py", line 9, in <module>
+    import torch.nn.functional as F
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
+    from .modules import *  # noqa: F403
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
+    from .transformer import TransformerEncoder, TransformerDecoder, \
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
+    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
+/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
+  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
+```
+
+**解决方案**
+
+```shell
+# 确保正确python环境
+conda activate vllm_musa
+
+pip3 install numpy==1.26.4
+```
+
+## Q2：ImportError: Please try running Python from a different directory!
+
+```python
+(vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())"
+Traceback (most recent call last):
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch_musa/__init__.py", line 41, in <module>
+    import torch_musa._MUSAC
+ImportError: /home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch_musa/lib/libmusa_python.so.2: undefined symbol: _ZN5torch12py_symbolizeERSt6vectorIPNS_17CapturedTracebackESaIS2_EE
+
+The above exception was the direct cause of the following exception:
+
+Traceback (most recent call last):
+  File "<string>", line 1, in <module>
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch_musa/__init__.py", line 43, in <module>
+    raise ImportError("Please try running Python from a different directory!") from err
+ImportError: Please try running Python from a different directory!
+```
+
+**解决方案**
+
+**`torch_musa` 与当前 `torch` 版本不兼容** 的典型表现，可能安装过程中包的版本有误，请删除安装的包重新安装
+
+```shell
+# 确保正确python环境
+conda activate vllm_musa
+
+# 删除
+pip3 uninstall torch torch_musa torchaudio torchvision
+```
+
+此处跳转安装torch_musa依赖： [安装 torch_musa依赖](#安装 torch_musa依赖)
+
+**注意：如果还不行，请重新安装一次musa环境，点击跳转：[环境安装](#envInstall)，安装完成后再将torch相关包重新安装一次**
+
+# 安装 VLLM 与 VLLM-MUSA
+
+```shell
+# 确保正确python环境
+conda activate vllm_musa
+
+# 进入依赖包目录（已进入忽略）
+cd ${packagedirectory}
+
+pip3 install -r requirements.txt
+pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl
+pip3 install vllm-0.9.2.dev259+gc2cd4356d-cp310-cp310-linux_aarch64.whl
+pip3 install vllm_musa-1.3+m1000-cp310-cp310-linux_aarch64.whl
+```
+
+# 验证 VLLM-MUSA
+
+```shell
+python3 -c "from vllm_musa import _musa_custom_ops;_musa_custom_ops.decode_mla"
+
+# 正常输出如下：
+(vllm_musa) dev@localhost:/home$  python3 -c "from vllm_musa import _musa_custom_ops;_musa_custom_ops.decode_mla"
+Error in cpuinfo: prctl(PR_SVE_GET_VL) failed  # 此处正常
+
+```
+
+# 模型下载与管理
+
+## 支持的量化模型
+
+当前版本对⻬ vllm 社区 v0.9.2，可以在 https://huggingface.co/ 直接下载开源模型。国内源可在魔塔上进行下载：https://modelscope.cn/home
+
+对于量化模型，我们提供以下加速版模型：
+
+| 类别         | 模型名称                         | 魔塔地址                                                     | git克隆地址                                                  |
+| ------------ | -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| DeepSeek     | gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git |
+|              | DeepSeek-R1-Distill-Qwen-1.5B    | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.git |
+|              | DeepSeek-R1-0528-Qwen3-8B        | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.git |
+| **Qwen2.5**  | Qwen2.5-7B-Instruct-GPTQ-Int4    | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4.git |
+|              | Qwen2.5-14B-Instruct-GPTQ-Int4   | https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4.git |
+|              | Qwen2.5-VL-3B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git    |
+|              | Qwen2.5-7B-Instruct              | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct        | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git       |
+|              | gptq-Qwen2.5-7B-Instruct-v2      | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git |
+|              | gptq-Qwen2.5-14B-Instruct        | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git |
+|              | Qwen2.5-VL-7B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-7B-Instruct.git    |
+|              | Qwen2.5-VL-3B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git    |
+|              | QwQ-32B-GPTQ-Int4                | https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4        | https://www.modelscope.cn/tclf90/qwq-32b-gptq-int4.git       |
+| **Qwen3**    | Qwen3-4B                         | https://modelscope.cn/models/Qwen/Qwen3-4B                   | https://www.modelscope.cn/Qwen/Qwen3-4B.git                  |
+|              | Qwen3-8B                         | https://modelscope.cn/models/Qwen/Qwen3-8B                   | https://www.modelscope.cn/Qwen/Qwen3-8B.git                  |
+|              | gptq-Qwen3-8B                    | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B       | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git          |
+|              | Qwen3-30B-A3B-GPTQ-Int4          | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4    | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git   |
+|              | Qwen3-Embedding-0.6B             | https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B       | https://www.modelscope.cn/Qwen/Qwen3-Embedding-0.6B.git      |
+|              | Qwen3-Embedding-4B               | https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B         | https://www.modelscope.cn/Qwen/Qwen3-Embedding-4B.git        |
+|              | Qwen3-Embedding-8B               | https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B         | https://www.modelscope.cn/Qwen/Qwen3-Embedding-8B.git        |
+|              | Qwen3-Reranker-0.6B              | https://modelscope.cn/models/Qwen/Qwen3-Reranker-0.6B        | https://www.modelscope.cn/Qwen/Qwen3-Reranker-0.6B.git       |
+|              | Qwen3-Reranker-8B                | https://modelscope.cn/models/Qwen/Qwen3-Reranker-8B          | https://www.modelscope.cn/Qwen/Qwen3-Reranker-8B.git         |
+| **OpenBMB**  | MiniCPM-V-4                      | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4             | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4.git            |
+|              | MiniCPM-V-4_5                    | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5           | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4_5.git          |
+|              | MiniCPM4.1-8B                    | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B           | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B.git          |
+|              | MiniCPM4.1-8B-GPTQ               | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B-GPTQ      | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B-GPTQ.git     |
+|              | MiniCPM4-0.5B                    | https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B           | https://www.modelscope.cn/OpenBMB/MiniCPM4-0.5B.git          |
+|              | BitCPM4-1B                       | https://modelscope.cn/models/OpenBMB/BitCPM4-1B              | https://www.modelscope.cn/OpenBMB/BitCPM4-1B.git             |
+|              | BitCPM4-0.5B                     | https://modelscope.cn/models/OpenBMB/BitCPM4-0.5B            | https://www.modelscope.cn/OpenBMB/BitCPM4-0.5B.git           |
+|              | MiniCPM-V-2_6                    | https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6           | https://www.modelscope.cn/OpenBMB/MiniCPM-V-2_6.git          |
+|              | InternVL3_5-8B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-8B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-8B.git       |
+|              | InternVL3_5-4B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-4B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-4B.git       |
+|              | InternVL3_5-2B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-2B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-2B.git       |
+|              | InternVL3_5-1B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-1B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-1B.git       |
+| **InfiniAI** | Megrez-3b-Instruct               | https://modelscope.cn/models/InfiniAI/Megrez-3b-Instruct     | https://www.modelscope.cn/InfiniAI/Megrez-3b-Instruct.git    |
+
+**PS**：如果对量化模型有速度要求，需要经过模型转换，模型转换⼯具后期会开放  
+
+## 下载模型
+
+> 📌 提示：模型较大（7B约11GB，30B约32GB），请预留足够磁盘空间。
+
+```shell
+sudo apt update
+
+sudo apt install git-lfs
+
+git lfs install
+
+cd ${WorkDir}
+
+# 克隆
+git clone https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git
+```
+
+# <span id ='启用性能模式'>启用性能模式</span>
+
+界面右上角: 电源 -> 性能
+
+# 启动模型服务
+
+> **PS：启动命令在32G运行内存环境下运行**
+>
+> - 若再32G以下运行，需要自行调整上下文的长度
+>
+> - 若调配参数仍然出现OOM情况，请点击[扩充虚拟内存](#扩充虚拟内存)
+
+## 清理缓存
+
+建议启动服前先清除缓存：
+
+```shell
+export TRITON_CACHE_DIR="/tmp/triton" && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"  
+```
+
+## 通用启动命令
+
+> --num-gpu-blocks-override 1024 --max-model-len 16384 适用场景为单并发16k上下文
+
+```shell
+export TRITON_CACHE_DIR="/tmp/triton" && \
+sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \
+vllm serve gptq-DeepSeek-R1-Distill-Qwen-7B \
+    --served_model_name gptq-DeepSeek-R1-Distill-Qwen-7B \
+    -tp 1 \
+    --gpu-memory-utilization 0.7 \
+    --quantization gptq \
+    --num-gpu-blocks-override 1024 \
+    --max-model-len 16384 \
+    --swap-space 0 \
+    --block-size 32
+    
+# vllm serve gptq-DeepSeek-R1-Distill-Qwen-7Bq 
+# 	- gptq-DeepSeek-R1-Distill-Qwen-7B 为模型的路径，需要在模型上级目录运行
+# 	- ${modelDir}/gptq-DeepSeek-R1-Distill-Qwen-7B  可以在任意目录下执行
+# 不使用served_model_name指定模型id，那么模型id将依据【模型路径】来命名
+```
+
+## Qwen3-30B 启动命令
+
+```shell
+export TRITON_CACHE_DIR="/tmp/triton" && \
+sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \
+vllm serve Qwen3-30B-A3B-GPTQ-Int4 \
+  -tp 1 \
+  --gpu_memory_utilization 0.7 \
+  --quantization gptq \
+  --max-model-len 16384  \
+  --max-num-seqs 1 \
+  --swap-space 0 \
+  --num-gpu-blocks-override 512  \
+  --enforce-eager \
+  --block_size 32
+  
+# vllm serve Qwen3-30B-A3B-GPTQ-Int4 
+# 	- Qwen3-30B-A3B-GPTQ-Int4  为模型的路径，需要在模型上级目录运行
+# 	- ${modelDir}/Qwen3-30B-A3B-GPTQ-Int4   可以在任意目录下执行
+# 不使用served_model_name指定模型id，那么模型id将依据【模型路径】来命名
+```
+
+**默认使用 v0 engine, 想使用 v1 engine 需要指定 VLLM_USE_V1=1，如：**
+
+```shell
+export TRITON_CACHE_DIR="/tmp/triton" && \
+sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \
+VLLM_USE_V1=1 \
+vllm serve Qwen3-30B-A3B-GPTQ-Int4 \
+  -tp 1 \
+  --gpu_memory_utilization 0.7 \
+  --quantization gptq \
+  --max-model-len 16384  \
+  --max-num-seqs 1 \
+  --swap-space 0 \
+  --num-gpu-blocks-override 512  \
+  --enforce-eager \
+  --block_size 32
+  
+# vllm serve Qwen3-30B-A3B-GPTQ-Int4 
+# 	- Qwen3-30B-A3B-GPTQ-Int4  为模型的路径，需要在模型上级目录运行
+# 	- ${modelDir}/Qwen3-30B-A3B-GPTQ-Int4   可以在任意目录下执行
+# 不使用served_model_name指定模型id，那么模型id将依据【模型路径】来命名
+```
+
+**取消思考模式**
+
+```shell
+wget https://qwen.readthedocs.io/en/latest/_downloads/c101120b5bebcc2f12ec504fc93a965e/qwen3_nonthinking.jinja
+
+export TRITON_CACHE_DIR="/tmp/triton" && \
+sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" && \
+VLLM_USE_V1=1 \
+vllm serve Qwen3-30B-A3B-GPTQ-Int4 \
+  -tp 1 \
+  --gpu_memory_utilization 0.7 \
+  --quantization gptq \
+  --max-model-len 16384  \
+  --max-num-seqs 1 \
+  --swap-space 0 \
+  --num-gpu-blocks-override 512  \
+  --enforce-eager \
+  --block_size 32 \
+  --chat-template ./qwen3_nonthinking.jinja
+  
+# vllm serve Qwen3-30B-A3B-GPTQ-Int4 
+# 	- Qwen3-30B-A3B-GPTQ-Int4  为模型的路径，需要在模型上级目录运行
+# 	- ${modelDir}/Qwen3-30B-A3B-GPTQ-Int4   可以在任意目录下执行
+# 不使用served_model_name指定模型id，那么模型id将依据【模型路径】来命名
+```
+
+## VLLM参数说明
+
+- **`model`**：模型路径
+- **`served_model_name`**：设置启动后模型的名称，默认使用model的路径命名
+- **`device`**：仅支持设置为`musa`
+- **`tensor-parallel-size`**：目前仅支持tp=1
+- **`dtype`**:  支持默认值`auto，float16，bfloat16`
+- **`kv-cache-dtype`**：仅支持默认值`auto`
+- **`pipeline-parallel-size`**：仅支持默认值`1`
+- **`max_num_batched_tokens`**，**`max_model_len`** ：需要根据运行的序列长度进行配置,如果出现OOM可减小这两个参数值，仍然出现OOM情况，请点击[扩充虚拟内存](#扩充虚拟内存)
+- **`enforce-eager`** : 表示立即执行，不启用musaGraph
+
+## 常见问题
+
+### Q1：There appear to be 1 leaked semaphore objects to clean up at shutdow
+
+```python
+INFO 07-31 10:06:10 executor_base.py:116] Maximum concurrency for 8192 tokens per request: 16.00x
+Traceback (most recent call last):
+  File "/home/dev/miniconda3/envs/vllm_musa/bin/vllm", line 8, in <module>
+    sys.exit(main())
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
+    args.dispatch_function(args)
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 34, in cmd
+    uvloop.run(run_server(args))
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
+    return loop.run_until_complete(wrapper())
+  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
+    return await main
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server
+    async with build_async_engine_client(args) as engine_client:
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/contextlib.py", line 199, in __aenter__
+    return await anext(self.gen)
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
+    async with build_async_engine_client_from_engine_args(
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/contextlib.py", line 199, in __aenter__
+    return await anext(self.gen)
+  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args
+    raise RuntimeError(
+RuntimeError: Engine process failed to start. See stack trace for the root cause.
+(vllm_musa) dev@localhost:~/model$ /home/dev/miniconda3/envs/vllm_musa/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
+  warnings.warn('resource_tracker: There appear to be %d '
+```
+
+**解决方案**
+
+- 可尝试重启设备
+
+- 尝试清除缓存：
+
+  ```shell
+  export TRITON_CACHE_DIR="/tmp/triton" && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"  
+  ```
+
+- 尝试将以下配置的参数调低：
+
+  ```shell
+  --num-gpu-blocks-override
+  --max-model-len 
+  ```
+
+- 扩充虚拟内存，请点击  [扩充虚拟内存](#扩充虚拟内存)
+
+# 模型服务调用与测试
+
+## 查看模型列表
+
+命令：
+
+```shell
+curl http://localhost:8000/v1/models
+```
+
+输出：
+
+```json
+{
+    "object": "list",
+    "data": [
+        {
+            "id": "gptq-DeepSeek-R1-Distill-Qwen-7B",
+            "object": "model",
+            "created": 1755164590,
+            "owned_by": "vllm",
+            "root": "gptq-DeepSeek-R1-Distill-Qwen-7B",
+            "parent": null,
+            "max_model_len": 16384,
+            "permission": [
+                {
+                    "id": "modelperm-d260b34830fe43b4a2b4dc2bff6adee5",
+                    "object": "model_permission",
+                    "created": 1755164590,
+                    "allow_create_engine": false,
+                    "allow_sampling": true,
+                    "allow_logprobs": true,
+                    "allow_search_indices": false,
+                    "allow_view": true,
+                    "allow_fine_tuning": false,
+                    "organization": "*",
+                    "group": null,
+                    "is_blocking": false
+                }
+            ]
+        }
+    ]
+}
+```
+
+## 发起对话请求
+
+另开⼀个窗⼝调⽤，需要替换为本地的模型路径   
+
+```shell
+curl http://localhost:8000/v1/chat/completions \
+	-H "Content-Type: application/json" \
+	-d '{
+        "model": "gptq-DeepSeek-R1-Distill-Qwen-7B",
+        "temperature": 0.7,
+        "top_p": 0.8,
+        "top_k": 20,
+        "repetition_penalty":1.05,
+        "max_tokens": 1000,
+        "messages": [{"role": "user", "content": "介绍一下北京"}]
+        }'
+        
+# 若运行时qwen3模型，该模型默认会进行思考，可在请求体中添加  "chat_template_kwargs": {"enable_thinking": false} 来取消思考
+# 结构如下：
+curl http://localhost:8000/v1/chat/completions \
+	-H "Content-Type: application/json" \
+	-d '{
+        "model": "某个Qwen3模型",
+        "temperature": 0.7,
+        "top_p": 0.8,
+        "top_k": 20,
+        "repetition_penalty":1.05,
+        "max_tokens": 1000,
+        "messages": [{"role": "user", "content": "介绍一下北京"}],
+        "chat_template_kwargs": {"enable_thinking": false}
+        }'
+```
+
+输出：
+
+```json
+{
+    "id": "chatcmpl-9b2d6b19177e46aea986aafa520bf2ee",
+    "object": "chat.completion",
+    "created": 1755164707,
+    "model": "gptq-DeepSeek-R1-Distill-Qwen-7B",
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "reasoning_content": null,
+                "content": "嗯，我现在要介绍一下北京。首先，我得想想北京有哪些方面可以写。用户给的介绍挺详细的，有地理位置、历史、文化、景点、美食等等。我应该按照这个结构来，确保每个部分都涵盖到。\n\n先从地理位置开始，北京位于华北，地理位置很重要，因为它周围有很多自然景观和历史古迹。然后是历史，北京作为古都，有很多历史遗迹，比如故宫、天坛这些地方。还有历史人物，比如爱新觉罗·博克，他可能在1927年访问过北京，这显示了北京在历史上的重要性。\n\n接下来是文化，北京有很多特色，比如四合院、胡同，还有北京的 dialects 和 food。可能需要提到一些著名的小吃，比如炸酱面、烤鸭这些，这样读者能感受到当地的美食。\n\n然后是景点，我得列出一些著名景点，比如故宫、天坛、鸟巢、水立方这些。每个景点的特点要简单说明一下，让用户知道去那里有什么可以看、做。\n\n文化生活方面，北京有很多艺术展览，比如790艺术区，还有电影学院。体育方面，国家体育馆和奥林匹克公园都是亮点，应该提到它们的功能和意义。\n\n现代发展也不能少，北京的城市建设，比如 602地块，还有天安门广场的现代化升级。科技方面，比如 5G 网络和, beijing app，这些现代基础设施让北京更便利。\n\n美食方面，除了炸酱面和烤鸭，还有其他特色菜，比如烤鸭、涮羊肉这些，可以推荐一些餐厅，但要注意不要太详细，保持简洁。\n\n交通方面，地铁、公交和出租车都是主要的出行方式，再加上一些著名景点的路线，帮助游客规划行程。\n\n最后，总结一下北京的魅力，它不仅是一个古老的城市，也是一个充满活力和创新的地方，适合不同的人前来生活。\n\n现在，检查一下有没有遗漏的部分，比如 maybe 提到一些其他的景点或者特色，或者更详细的描述。不过，保持每个部分简短，重点突出，避免过于冗长。\n\n可能还需要考虑一下，用户的需求是什么。他们可能想知道北京的历史文化、美食，还是现代发展？根据之前的介绍，已经涵盖了这些方面，所以我觉得结构已经很全面了。\n\n另外，语言风格要口语化，避免使用任何markdown格式，用简单的中文表达，让读者容易理解。\n\n总的来说，我需要按照地理位置、历史、文化、美食、景点、现代发展这几个方面来组织内容，每个部分简明扼要，突出重点，让介绍既全面又易于阅读。\n</think>\n\n北京，这座历史悠久的城市，位于中国华北，地理环境优越，拥有丰富的历史文化和现代化发展。以下是关于北京的详细介绍：\n\n### 地理与历史\n北京位于华北平原，地理位置优越，东距渤海约100公里，北距Inner Mongolia自治区西南部约100公里。历史悠久的古都，曾是历史上 Activity 的政治、经济和文化中心。作为古都，北京拥有众多历史遗迹，如故宫、天坛、北海 and 天干支.\n\n### 文化与历史\n北京是中国历史文化名城，拥有众多历史遗迹，如故宫、天坛、北海 and 天干支. 历史人物如爱新觉罗·博克曾在此访问，显示其重要性. 北京是中华文明的重要发源地，孕育了无数文化名人.\n\n### 文化与美食\n北京以其独特的 dialect 和美食闻名，如炸酱面、烤鸭和涮羊肉. 拥有众多特色餐厅，适合品尝当地美食. 历史与现代结合，使其成为美食爱好者的天堂.\n\n### 景点与活动\n北京拥有众多著名景点，如故宫、鸟巢、水立方和奥林匹克公园. 它们不仅是旅游胜地，也是举办各种活动的理想场所. 国家体育场等大型设施展示了现代化城市的精神.\n\n### 现代发展\n北京在现代化进程中不断进步，拥有现代化的基础设施和先进的科技. 例如，602地块的混合式地块计划，提升了城市面貌. 作为国际交往的中心，北京在现代生活中扮演重要角色.\n\n### 经济与交通\n北京作为北方重要的交通枢纽，拥有发达的经济基础. 市区拥有6条主轴线，成为东西向的经济带. 交通系统完善，地铁、公交和出租车等多种出行方式方便游客.\n\n### 结语\n北京以其悠久的历史和现代化发展，展现出独特魅力. 无论是历史爱好者还是美食探索者，都能在北京市找到满足需求的地方. 无论是古代文化还是现代科技，北京都以其独特的方式吸引着每一位访客.",
+                "tool_calls": []
+            },
+            "logprobs": null,
+            "finish_reason": "stop",
+            "stop_reason": null
+        }
+    ],
+    "usage": {
+        "prompt_tokens": 9,
+        "total_tokens": 989,
+        "completion_tokens": 980,
+        "prompt_tokens_details": null
+    },
+    "prompt_logprobs": null
+}
+```
+
+## python调用
+
+### 流式输出
+
+```python
+from openai import OpenAI
+
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+
+client = OpenAI(
+    # defaults to os.environ.get("OPENAI_API_KEY")
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+
+models = client.models.list()
+model = models.data[0].id
+
+chat_completion = client.chat.completions.create(
+    messages=[{
+        "role": "system",
+        "content": "You are a helpful assistant."
+    }, {
+        "role": "user",
+        "content": "北京有哪些名胜古迹？"
+    }],
+    model=model,
+    temperature=0.7,
+    top_p=0.8,
+    extra_body={
+        'top_k':20,
+        'repetition_penalty':1.05, # 惩罚重复,vllm默认没有加载需要添加，参考：https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct/file/view/master?fileName=generation_config.json&status=1#L9
+        # "chat_template_kwargs": {"enable_thinking": False},  # 对于Qwen3模型取消思考
+    },
+    max_tokens=512,
+    stream=True,  # 启用流式输出
+)
+
+# 处理流式响应
+print("Chat response (streaming):")
+for chunk in chat_completion:
+    if chunk.choices:
+        delta = chunk.choices[0].delta
+        content = delta.content
+        if content:
+            print(content, end='', flush=True)
+print("\n - Chat response (end) -\n")
+```
+
+### 非流式输出
+
+```python
+from openai import OpenAI
+
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+
+client = OpenAI(
+    # defaults to os.environ.get("OPENAI_API_KEY")
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+
+models = client.models.list()
+model = models.data[0].id
+
+chat_completion = client.chat.completions.create(
+    messages=[{
+        "role": "system",
+        "content": "You are a helpful assistant."
+    }, {
+        "role": "user",
+        "content": "北京有哪些名胜古迹？"
+    }],
+    model=model,
+    temperature=0.7,
+    top_p=0.8,
+    extra_body={
+        'top_k':20,
+        'repetition_penalty':1.05,
+         # "chat_template_kwargs": {"enable_thinking": False},  # 对于Qwen3模型取消思考
+    },
+    max_tokens=512,
+    stream=False,
+)
+
+print("Chat completion results:")
+print(chat_completion)
+```
+
+# 性能测试方法
+
+> ⼀个窗⼝吊起模型服务，另⼀个窗⼝跑性能测试的脚本 。测试时推荐使用性能模式：[启用性能模式](#启用性能模式)
+
+## VLLM压测
+
+```shell
+# 进入工作目录
+cd ${WorkDir}
+
+# 确保正确python环境
+conda activate vllm_musa
+
+git clone https://github.com/vllm-project/vllm.git
+
+cd vllm
+
+git checkout v0.9.2
+
+cd benchmarks/
+
+# model 参数需要对照已启动模型id进行填写
+# ${modelDir}注意替换成模型的存放目录 
+python3 benchmark_serving.py \
+  --base-url http://127.0.0.1:8000 \
+  --model gptq-DeepSeek-R1-Distill-Qwen-7B \
+  --tokenizer ${modelDir}/models/gptq-DeepSeek-R1-Distill-Qwen-7B \
+  --dataset_name random \
+  --random_input_len 128 \
+  --random_output_len 128 \
+  --num-prompts 1 \
+  --trust-remote-code \
+  --ignore-eos
+  
+```
+
+### 常见问题
+
+#### Q1：`GLIBCXX_3.4.30' not found
+
+```shell
+Traceback (most recent call last):
+  File "/home/skysi/04-vllm-musa/vllm/benchmarks/benchmark_serving.py", line 37, in <module>
+    from backend_request_func import (ASYNC_REQUEST_FUNCS,
+  File "/home/skysi/04-vllm-musa/vllm/benchmarks/backend_request_func.py", line 15, in <module>
+    from transformers import (AutoTokenizer, PreTrainedTokenizer,
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/__init__.py", line 27, in <module>
+    from . import dependency_versions_check
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/dependency_versions_check.py", line 16, in <module>
+    from .utils.versions import require_version, require_version_core
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/utils/__init__.py", line 24, in <module>
+    from .args_doc import (
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/utils/args_doc.py", line 30, in <module>
+    from .generic import ModelOutput
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/transformers/utils/generic.py", line 46, in <module>
+    import torch  # noqa: F401
+  File "/home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 368, in <module>
+    from torch._C import *  # noqa: F403
+ImportError: /home/skysi/miniconda3/envs/vllm_musa/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/skysi/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
+```
+
+**解决**
+
+```shell
+# 确保正确python环境
+conda activate vllm_musa
+
+conda install -c conda-forge libstdcxx-ng=12.1.0
+```
+
+## EvalScope 压测
+
+> EvalScope 是魔搭社区推出的**模型评测与性能基准测试框架**
+
+```shell
+# 创建虚拟环境
+conda create -n evalscope python=3.10 -y
+
+# 激活虚拟环境
+conda activate evalscope
+
+# 安装 EvalScope
+pip install -U 'evalscope[perf]' plotly gradio wandb
+
+
+# 10 请求 1并发
+evalscope perf \
+  --url "http://localhost:8000/v1/chat/completions" \
+  --api-key "" \
+  --model gptq-DeepSeek-R1-Distill-Qwen-7B \
+  --number 10 \
+  --parallel 1 \
+  --api openai \
+  --dataset openqa \
+  --stream
+```
+
+# <span id='memory'>扩充虚拟内存</span>
+
+PS：16G盒⼦能跑通7B模型8k上下⽂需要配置这⼀步  
+
+## 配置 swap 分区  
+
+```SHELL
+# 先把当前所有分区都关闭了
+swapoff -a
+# 创建要作为 Swap 分区⽂件，这⼀步耗时略久
+dd if=/dev/zero of=/var/swapfile bs=1G count=8
+# 建⽴ Swap 的⽂件系统
+mkswap /var/swapfile
+# 启⽤ Swap 分区
+swapon /var/swapfile
+# 查看 Linux 当前分区，需要看到 swap 部分 total 是 8xxx
+free -m
+# 开机启动
+sudo swapon --show
+echo '/var/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
+```
+
+## 配置显存 Page Size    
+
+```shell
+# 1. 执⾏
+sudo su
+# 2. 修改 /etc/modprobe.d/mtgpu.conf 内容为:
+options mtgpu mtgpu_drm_major=2 GeneralSVMHeapPageSize=0x1000 
+# 3. 执⾏以下命令：
+update-initramfs -u # 会看到 cryptsetup 相关的 ERROR WARNING，是正常的 
+reboot
+```
diff --git "a/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md" "b/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
index 35a89fa82c833a051b4124d0ff95d7dc0cb8a152..2c8bd97f90d591923db1499ae779ea1ad4b8005d 100644
--- "a/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
+++ "b/thirdparty/skynoon/user_cases/T035EVB-MTSDK-1.3.0/T035-vllm-musa\350\277\233\350\241\214\345\244\247\346\250\241\345\236\213\346\216\250\347\220\206.md"
@@ -163,34 +163,7 @@ True
 
 # 常见问题
 
-## Q1：libmpi_cxx.so.40: cannot open shared object file: No such file or directory
-
-```shell
-(vllm_musa) dev@localhost:~/model$ 
-(vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())"
-Traceback (most recent call last):
-  File "<string>", line 1, in <module>
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 236, in <module>
-    _load_global_deps()
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 195, in _load_global_deps
-    raise err
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/site-packages/torch/__init__.py", line 176, in _load_global_deps
-    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
-  File "/home/dev/miniconda3/envs/vllm_musa/lib/python3.10/ctypes/__init__.py", line 374, in __init__
-    self._handle = _dlopen(self._name, mode)
-OSError: libmpi_cxx.so.40: cannot open shared object file: No such file or directory
-```
-
-**解决方案**
-
-证明mpi so缺失，请下载以下依赖：
-
-```shell
-sudo apt update
-
-```
-
-## Q2：NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
+## Q1：NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
 
 ```python
 A module that was compiled using NumPy 1.x cannot be run in
@@ -226,7 +199,7 @@ conda activate vllm_musa
 pip3 install numpy==1.26.4
 ```
 
-## Q3：ImportError: Please try running Python from a different directory!
+## Q2：ImportError: Please try running Python from a different directory!
 
 ```python
 (vllm_musa) dev@localhost:~/model$ python3 -c "import torch;import torch_musa;print(torch.musa.is_available())"
@@ -260,13 +233,13 @@ pip3 uninstall torch torch_musa torchaudio torchvision
 
 **注意：如果还不行，请重新安装一次musa环境，点击跳转：[环境安装](#envInstall)，安装完成后再将torch相关包重新安装一次**
 
-## Q4：ImportError: libmccl.so.2: cannot open shared object file: No such file or directory
+## Q3：ImportError: libmccl.so.2: cannot open shared object file: No such file or directory
 
 **解决方案**
 
 请重新安装一次musa环境，点击跳转：[环境安装](#envInstall)，安装完成后再将torch相关包重新安装一次
 
-## Q5： MUSA driver initialization failed
+## Q4： MUSA driver initialization failed
 
 ```shell
 Traceback (most recent call last):
@@ -332,8 +305,8 @@ conda activate vllm_musa
 cd ${packagedirectory}
 
 pip3 install -r requirements.txt
-pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl 
-pip3 install vllm-0.9.2.dev257+g4747b491f-cp310-cp310-linux_aarch64.whl
+pip3 install triton-3.1.0-cp310-cp310-linux_aarch64.whl --force-reinstall
+pip3 install vllm-0.9.2.dev259+gc2cd4356d-cp310-cp310-linux_aarch64.whl
 pip3 install vllm_musa-1.3+m1000-cp310-cp310-linux_aarch64.whl
 ```
 
@@ -356,13 +329,42 @@ Error in cpuinfo: prctl(PR_SVE_GET_VL) failed  # 此处正常
 
 对于量化模型，我们提供以下加速版模型：
 
-| 模型名称                         | 魔塔地址                                                     | git克隆地址                                                  |
-| -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git |
-| gptq-Qwen2.5-7B-Instruct-v2      | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git |
-| gptq-Qwen2.5-14B-Instruct        | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git |
-| gptq-Qwen3-8B                    | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B       | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git          |
-| Qwen3-30B-A3B-GPTQ-Int4          | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4    | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git   |
+| 类别         | 模型名称                         | 魔塔地址                                                     | git克隆地址                                                  |
+| ------------ | -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| DeepSeek     | gptq-DeepSeek-R1-Distill-Qwen-7B | https://modelscope.cn/models/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B | https://www.modelscope.cn/hiruyun/gptq-DeepSeek-R1-Distill-Qwen-7B.git |
+|              | DeepSeek-R1-Distill-Qwen-1.5B    | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.git |
+|              | DeepSeek-R1-0528-Qwen3-8B        | https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.git |
+| **Qwen2.5**  | Qwen2.5-7B-Instruct-GPTQ-Int4    | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4.git |
+|              | Qwen2.5-14B-Instruct-GPTQ-Int4   | https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4 | https://www.modelscope.cn/Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4.git |
+|              | Qwen2.5-VL-3B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git    |
+|              | Qwen2.5-7B-Instruct              | https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct        | https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git       |
+|              | gptq-Qwen2.5-7B-Instruct-v2      | https://www.modelscope.cn/models/hiruyun/gptq-Qwen2.5-7B-Instruct-v2 | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-7B-Instruct-v2.git |
+|              | gptq-Qwen2.5-14B-Instruct        | https://modelscope.cn/models/hiruyun/gptq-Qwen2.5-14B-Instruct | https://www.modelscope.cn/hiruyun/gptq-Qwen2.5-14B-Instruct.git |
+|              | Qwen2.5-VL-7B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-7B-Instruct.git    |
+|              | Qwen2.5-VL-3B-Instruct           | https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct     | https://www.modelscope.cn/Qwen/Qwen2.5-VL-3B-Instruct.git    |
+|              | QwQ-32B-GPTQ-Int4                | https://modelscope.cn/models/tclf90/qwq-32b-gptq-int4        | https://www.modelscope.cn/tclf90/qwq-32b-gptq-int4.git       |
+| **Qwen3**    | Qwen3-4B                         | https://modelscope.cn/models/Qwen/Qwen3-4B                   | https://www.modelscope.cn/Qwen/Qwen3-4B.git                  |
+|              | Qwen3-8B                         | https://modelscope.cn/models/Qwen/Qwen3-8B                   | https://www.modelscope.cn/Qwen/Qwen3-8B.git                  |
+|              | gptq-Qwen3-8B                    | https://www.modelscope.cn/models/hiruyun/gptq-Qwen3-8B       | https://www.modelscope.cn/hiruyun/gptq-Qwen3-8B.git          |
+|              | Qwen3-30B-A3B-GPTQ-Int4          | https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4    | https://www.modelscope.cn/Qwen/Qwen3-30B-A3B-GPTQ-Int4.git   |
+|              | Qwen3-Embedding-0.6B             | https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B       | https://www.modelscope.cn/Qwen/Qwen3-Embedding-0.6B.git      |
+|              | Qwen3-Embedding-4B               | https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B         | https://www.modelscope.cn/Qwen/Qwen3-Embedding-4B.git        |
+|              | Qwen3-Embedding-8B               | https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B         | https://www.modelscope.cn/Qwen/Qwen3-Embedding-8B.git        |
+|              | Qwen3-Reranker-0.6B              | https://modelscope.cn/models/Qwen/Qwen3-Reranker-0.6B        | https://www.modelscope.cn/Qwen/Qwen3-Reranker-0.6B.git       |
+|              | Qwen3-Reranker-8B                | https://modelscope.cn/models/Qwen/Qwen3-Reranker-8B          | https://www.modelscope.cn/Qwen/Qwen3-Reranker-8B.git         |
+| **OpenBMB**  | MiniCPM-V-4                      | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4             | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4.git            |
+|              | MiniCPM-V-4_5                    | https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5           | https://www.modelscope.cn/OpenBMB/MiniCPM-V-4_5.git          |
+|              | MiniCPM4.1-8B                    | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B           | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B.git          |
+|              | MiniCPM4.1-8B-GPTQ               | https://modelscope.cn/models/OpenBMB/MiniCPM4.1-8B-GPTQ      | https://www.modelscope.cn/OpenBMB/MiniCPM4.1-8B-GPTQ.git     |
+|              | MiniCPM4-0.5B                    | https://modelscope.cn/models/OpenBMB/MiniCPM4-0.5B           | https://www.modelscope.cn/OpenBMB/MiniCPM4-0.5B.git          |
+|              | BitCPM4-1B                       | https://modelscope.cn/models/OpenBMB/BitCPM4-1B              | https://www.modelscope.cn/OpenBMB/BitCPM4-1B.git             |
+|              | BitCPM4-0.5B                     | https://modelscope.cn/models/OpenBMB/BitCPM4-0.5B            | https://www.modelscope.cn/OpenBMB/BitCPM4-0.5B.git           |
+|              | MiniCPM-V-2_6                    | https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6           | https://www.modelscope.cn/OpenBMB/MiniCPM-V-2_6.git          |
+|              | InternVL3_5-8B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-8B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-8B.git       |
+|              | InternVL3_5-4B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-4B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-4B.git       |
+|              | InternVL3_5-2B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-2B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-2B.git       |
+|              | InternVL3_5-1B                   | https://modelscope.cn/models/OpenGVLab/InternVL3_5-1B        | https://www.modelscope.cn/OpenGVLab/InternVL3_5-1B.git       |
+| **InfiniAI** | Megrez-3b-Instruct               | https://modelscope.cn/models/InfiniAI/Megrez-3b-Instruct     | https://www.modelscope.cn/InfiniAI/Megrez-3b-Instruct.git    |
 
 **PS**：如果对量化模型有速度要求，需要经过模型转换，模型转换⼯具后期会开放  
 
@@ -1009,109 +1011,3 @@ update-initramfs -u # 会看到 cryptsetup 相关的 ERROR WARNING，是正常
 reboot
 ```
 
-# 性能参考
-
-> 以下是T035开发板在性能模式下，配备32GB运行内存性能测试结果。
-
-## gptq-DeepSeek-R1-Distill-Qwen-7B
-
-> EvalScope 压测 10请求1并发
-
-| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** |
-| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- |
-| 10%        | 0.1439                     | 0.0930                   | 0.1011                     | 51.8112              | 17              | 512             | 9.4239                 | 9.5926               |
-| 25%        | 0.1570                     | 0.0974                   | 0.1020                     | 54.4006              | 20              | 530             | 9.4354                 | 9.7942               |
-| 50%        | 0.1604                     | 0.1045                   | 0.1028                     | 61.1391              | 26              | 597             | 9.7261                 | 10.1899              |
-| 66%        | 0.1658                     | 0.1100                   | 0.1034                     | 70.4440              | 28              | 681             | 9.7425                 | 10.2138              |
-| 75%        | 0.1666                     | 0.1139                   | 0.1058                     | 102.6116             | 29              | 967             | 9.7646                 | 10.2680              |
-| 80%        | 0.4193                     | 0.1167                   | 0.1060                     | 108.1039             | 34              | 1020            | 9.8820                 | 10.3675              |
-| 90%        | 0.4246                     | 0.1233                   | 0.1151                     | 235.7246             | 38              | 2048            | 10.0055                | 10.7530              |
-| 95%        | 0.4246                     | 0.1281                   | 0.1151                     | 235.7246             | 38              | 2048            | 10.0055                | 10.7530              |
-| 98%        | 0.4246                     | 0.1338                   | 0.1151                     | 235.7246             | 38              | 2048            | 10.0055                | 10.7530              |
-| 99%        | 0.4246                     | 0.1381                   | 0.1151                     | 235.7246             | 38              | 2048            | 10.0055                | 10.7530              |
-
-## gptq-Qwen2.5-7B-Instruct-v2
-
-> EvalScope 压测 10请求1并发
-
-| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** |
-| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- |
-| 10%        | 0.3876                     | 0.0838                   | 0.0935                     | 16.0308              | 41              | 169             | 10.2386                | 11.8103              |
-| 25%        | 0.3910                     | 0.0876                   | 0.0942                     | 21.1318              | 44              | 215             | 10.3771                | 12.0867              |
-| 50%        | 0.4023                     | 0.0918                   | 0.0945                     | 26.9652              | 50              | 283             | 10.4653                | 12.7522              |
-| 66%        | 0.4243                     | 0.0958                   | 0.0945                     | 27.1373              | 52              | 284             | 10.4950                | 12.9189              |
-| 75%        | 0.4272                     | 0.1009                   | 0.0950                     | 27.8147              | 53              | 294             | 10.5140                | 13.2493              |
-| 80%        | 0.4291                     | 0.1032                   | 0.0953                     | 35.3929              | 58              | 368             | 10.5422                | 13.5364              |
-| 90%        | 0.4336                     | 0.1080                   | 0.0967                     | 43.4659              | 62              | 457             | 10.5699                | 14.6113              |
-| 95%        | 0.4336                     | 0.1149                   | 0.0967                     | 43.4659              | 62              | 457             | 10.5699                | 14.6113              |
-| 98%        | 0.4336                     | 0.1172                   | 0.0967                     | 43.4659              | 62              | 457             | 10.5699                | 14.6113              |
-| 99%        | 0.4336                     | 0.1183                   | 0.0967                     | 43.4659              | 62              | 457             | 10.5699                | 14.6113              |
-
-## gptq-Qwen3-8B
-
-> EvalScope 压测 10请求1并发
-
-| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** |
-| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- |
-| 10%        | 0.2059                     | 0.1117                   | 0.1357                     | 163.6918             | 20              | 1202            | 6.9136                 | 7.0131               |
-| 25%        | 0.2197                     | 0.1248                   | 0.1359                     | 164.0903             | 23              | 1209            | 7.0599                 | 7.2642               |
-| 50%        | 0.2248                     | 0.1406                   | 0.1390                     | 191.3759             | 29              | 1375            | 7.3108                 | 7.4726               |
-| 66%        | 0.2374                     | 0.1503                   | 0.1400                     | 192.6090             | 31              | 1376            | 7.3431                 | 7.4985               |
-| 75%        | 0.2386                     | 0.1557                   | 0.1414                     | 200.7116             | 32              | 1417            | 7.3526                 | 7.5299               |
-| 80%        | 0.4726                     | 0.1579                   | 0.1446                     | 231.1392             | 37              | 1598            | 7.3642                 | 7.5446               |
-| 90%        | 0.4842                     | 0.1668                   | 0.1504                     | 276.7050             | 41              | 1839            | 7.3679                 | 7.5691               |
-| 95%        | 0.4842                     | 0.1738                   | 0.1504                     | 276.7050             | 41              | 1839            | 7.3679                 | 7.5691               |
-| 98%        | 0.4842                     | 0.1821                   | 0.1504                     | 276.7050             | 41              | 1839            | 7.3679                 | 7.5691               |
-| 99%        | 0.4842                     | 0.1872                   | 0.1504                     | 276.7050             | 41              | 1839            | 7.3679                 | 7.5691               |
-
-## gptq-Qwen2.5-14B-Instruct
-
-> EvalScope 压测 10请求1并发
-
-| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** |
-| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- |
-| 10%        | 0.8191                     | 0.1756                   | 0.1882                     | 44.1347              | 41              | 230             | 5.1026                 | 5.7233               |
-| 25%        | 0.8244                     | 0.1817                   | 0.1889                     | 44.2783              | 44              | 231             | 5.1032                 | 5.7562               |
-| 50%        | 0.8650                     | 0.1914                   | 0.1919                     | 62.6467              | 50              | 323             | 5.1758                 | 5.9859               |
-| 66%        | 0.8704                     | 0.1962                   | 0.1935                     | 74.4713              | 52              | 380             | 5.2113                 | 6.1430               |
-| 75%        | 0.8746                     | 0.1994                   | 0.1942                     | 76.0923              | 53              | 388             | 5.2170                 | 6.3552               |
-| 80%        | 0.8829                     | 0.2017                   | 0.1943                     | 76.6646              | 58              | 393             | 5.2316                 | 6.5066               |
-| 90%        | 0.8848                     | 0.2094                   | 0.1944                     | 77.4029              | 62              | 395             | 5.2960                 | 6.6161               |
-| 95%        | 0.8848                     | 0.2168                   | 0.1944                     | 77.4029              | 62              | 395             | 5.2960                 | 6.6161               |
-| 98%        | 0.8848                     | 0.2248                   | 0.1944                     | 77.4029              | 62              | 395             | 5.2960                 | 6.6161               |
-| 99%        | 0.8848                     | 0.2305                   | 0.1944                     | 77.4029              | 62              | 395             | 5.2960                 | 6.6161               |
-
-## Qwen3-30B-A3B-GPTQ-Int4
-
-> EvalScope 压测 10请求1并发
-
-| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** |
-| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- |
-| 10%        | 0.4120                     | 0.0712                   | 0.0721                     | 77.0906              | 20              | 1040            | 10.7279                | 11.0010              |
-| 25%        | 0.4657                     | 0.0714                   | 0.0727                     | 85.1089              | 23              | 1160            | 13.4448                | 13.6327              |
-| 50%        | 0.5578                     | 0.0722                   | 0.0731                     | 103.3427             | 29              | 1350            | 13.6296                | 13.9616              |
-| 66%        | 0.5616                     | 0.0733                   | 0.0734                     | 106.3631             | 31              | 1369            | 13.6809                | 13.9950              |
-| 75%        | 0.7475                     | 0.0747                   | 0.0741                     | 117.1708             | 32              | 1403            | 13.6890                | 14.0225              |
-| 80%        | 0.7949                     | 0.0757                   | 0.0927                     | 122.4263             | 37              | 1456            | 13.7975                | 14.0497              |
-| 90%        | 0.8312                     | 0.0840                   | 0.1048                     | 143.7252             | 41              | 1646            | 13.8157                | 14.2232              |
-| 95%        | 0.8312                     | 0.0982                   | 0.1048                     | 143.7252             | 41              | 1646            | 13.8157                | 14.2232              |
-| 98%        | 0.8312                     | 0.1420                   | 0.1048                     | 143.7252             | 41              | 1646            | 13.8157                | 14.2232              |
-| 99%        | 0.8312                     | 0.2052                   | 0.1048                     | 143.7252             | 41              | 1646            | 13.8157                | 14.2232              |
-
-## Qwen3-30B-A3B-GPTQ-Int4： v1 engine
-
-> EvalScope 压测 10请求1并发
-
-| **百分位** | **首 token 时间 TTFT (s)** | **token 间延迟 ITL (s)** | **每 token 耗时 TPOT (s)** | **延迟 Latency (s)** | **输入 tokens** | **输出 tokens** | **输出吞吐量 (tok/s)** | **总吞吐量 (tok/s)** |
-| ---------- | -------------------------- | ------------------------ | -------------------------- | -------------------- | --------------- | --------------- | ---------------------- | -------------------- |
-| 10%        | 0.3906                     | 0.0699                   | 0.0713                     | 75.1009              | 20              | 1040            | 13.8222                | 14.1051              |
-| 25%        | 0.4237                     | 0.0702                   | 0.0714                     | 83.3085              | 23              | 1160            | 13.8480                | 14.1867              |
-| 50%        | 0.5412                     | 0.0707                   | 0.0715                     | 97.1227              | 29              | 1350            | 13.9020                | 14.2191              |
-| 66%        | 0.5843                     | 0.0711                   | 0.0716                     | 98.4750              | 31              | 1369            | 13.9186                | 14.2378              |
-| 75%        | 0.7587                     | 0.0714                   | 0.0717                     | 101.5034             | 32              | 1403            | 13.9241                | 14.2482              |
-| 80%        | 0.7930                     | 0.0719                   | 0.0718                     | 104.6081             | 37              | 1456            | 13.9780                | 14.3940              |
-| 90%        | 0.8313                     | 0.0742                   | 0.0724                     | 119.5024             | 41              | 1646            | 13.9812                | 14.4125              |
-| 95%        | 0.8313                     | 0.0786                   | 0.0724                     | 119.5024             | 41              | 1646            | 13.9812                | 14.4125              |
-| 98%        | 0.8313                     | 0.0821                   | 0.0724                     | 119.5024             | 41              | 1646            | 13.9812                | 14.4125              |
-| 99%        | 0.8313                     | 0.0834                   | 0.0724                     | 119.5024             | 41              | 1646            | 13.9812                | 14.4125              |
-