# SenseVoice.cpp **Repository Path**: kejiing/SenseVoice.cpp ## Basic Information - **Project Name**: SenseVoice.cpp - **Description**: 国内镜像SenseVoice.cpp,音频模型的端侧推理实现,类比llama.cpp - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-08-31 - **Last Updated**: 2024-09-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SenseVoice.cpp [「简体中文」](./README.md)|「English」 [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)SenseVoice is an audio foundation model with audio understanding capabilities, including Automatic Speech Recognition (ASR), Language Identification (LID), Speech Emotion Recognition (SER), and Acoustic Event Classification (AEC) or Acoustic Event Detection (AED). Currently, SenseVoice-small supports multilingual speech recognition, emotion recognition, and event detection capabilities in Mandarin, Cantonese, English, Japanese, and Korean, with extremely low inference latency. This project is based on the [ggml](https://github.com/ggerganov/ggml) framework. ## 1. Features 1. Based on ggml, it does not rely on other third-party libraries and is committed to edge deployment. 2. Feature extraction references the [kaldi-native-fbank](https://github.com/csukuangfj/kaldi-native-fbank) library, supporting multi-threaded feature extraction. 3. Flash attention decoding can be used (The speed has not improved 🤔 weird,need help). 4. Support Q3, Q4, Q5, Q6, Q8 quantization. ### 1.1 Future Plans 1. Support more backends. In theory, ggml supports the following backends, and future adaptations will be gradually made. Contributions are welcome. | Backend | Device | Supported | |--------------------------------------|----------------------|--------------------------------| | CPU | All | ✅ | | [Metal](./docs/build.md#metal-build) | Apple Silicon | (some bugs in im2col operator) | | [BLAS](./docs/build.md#blas-build) | All | ✅ | | [BLIS](./docs/backend/BLIS.md) | All | | | [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU | | | [MUSA](./docs/build.md#musa) | Moore Threads GPU | | | [CUDA](./docs/build.md#cuda) | Nvidia GPU | | | [hipBLAS](./docs/build.md#hipblas) | AMD GPU | | | [Vulkan](./docs/build.md#vulkan) | GPU | | | [Cann](./docs/build.md#vulkan) | Ascend NPU | | 2. Improve performance. 3. Fix bugs. ## 2. Usage ### Download Model or Convert Model You can download the model directly from the links below: [huggingface](https://huggingface.co/lovemefan/sense-voice-gguf) [modelscope](https://www.modelscope.cn/models/lovemefan/SenseVoiceGGUF) ```bash git lfs install git clone https://huggingface.co/lovemefan/sense-voice-gguf.git # 或从modelscope下载 git clone https://www.modelscope.cn/models/lovemefan/SenseVoiceGGUF.git ``` Alternatively, download the official model and convert it yourself: ```bash # Download the official model git lfs install git clone https://www.modelscope.cn/iic/SenseVoiceSmall.git # Convert the model python scripts/convert-pt-to-gguf.py \ --model SenseVoiceSmall \ --output /path/to/export/gguf-fp32-sense-voice-small.bin \ --out_type f32 ``` ### RUN ```bash git clone https://github.com/lovemefan/SenseVoice.cpp cd SenseVoice.cpp git submodule sync && git submodule update --init --recursive mkdir build && cd build cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 8 # -t means thread num ./bin/sense-voice-main -m /path/gguf-fp16-sense-voice-small.bin /path/asr_example_zh.wav -t 4 -ng ``` ### Output Currently using the sense-voice-f16 model for output: ``` $./bin/sense-voice-main -m /data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice.bin /data/code/SenseVoice.cpp/scripts/resources/SenseVoiceSmall/example/asr_example_zh.wav -t 4 sense_voice_small_init_from_file_with_params_no_state: loading model from '/data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice-small.bin' sense_voice_model_load: version: 3 sense_voice_model_load: alignment: 32 sense_voice_model_load: data offset: 444480 sense_voice_model_load: loading model sense_voice_model_load: n_vocab = 25055 sense_voice_model_load: n_encoder_hidden_state = 512 sense_voice_model_load: n_encoder_linear_units = 2048 sense_voice_model_load: n_encoder_attention_heads = 4 sense_voice_model_load: n_encoder_layers = 50 sense_voice_model_load: n_mels = 80 sense_voice_model_load: ftype = 1 sense_voice_model_load: vocab[25055] loaded sense_voice_model_load: CPU total size = 468.98 MB sense_voice_model_load: n_tensors: 1197 sense_voice_model_load: load SenseVoiceSmall takes 0.213000 second sense_voice_init_state: compute buffer (encoder) = 50.40 MB sense_voice_init_state: compute buffer (decoder) = 13.72 MB system_info: n_threads = 4 / 256 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 main: processing audio (88747 samples, 5.54669 sec) , 4 threads, 1 processors, lang = auto... sense_voice_pcm_to_feature_with_state: calculate fbank and cmvn takes 7.207 ms <|zh|><|NEUTRAL|><|Speech|><|withitn|>欢迎大家来体验达摩院推出的语音识别模型。 sense_voice_full_with_state: decoder audio use 1.011289 s, rtf is 0.182323. ``` ## Acknowledgements 1. This project borrows and mimics most of the C++ code from [whisper.cpp](https://github.com/ggerganov/ggml/blob/master/examples/whisper/whisper.cpp). 2. References the paraformer model structure and forward computation from [FunASR](https://github.com/alibaba-damo-academy/FunASR). 3. Feature extraction algorithm borrowed from [kaldi-native-fbank](https://github.com/csukuangfj/kaldi-native-fbank) and the lrf + cmvn algorithm in [FunASR](https://github.com/alibaba-damo-academy/FunASR/blob/main/runtime/onnxruntime/src/paraformer.cpp#L337C22-L372). 4. Utilizes a lot of preliminary work from [paraformer.cpp](https://github.com/lovemefan/paraformer.cpp), which will continue to be updated.