Step-Audio：多语言、多风格的智能语音交互

Step-Audio: Intelligent Voice Interaction in Multiple Languages and Styles

Step-Audio is a repository of open source frameworks for intelligent voice interaction:

Basic Information

Multi-language support: Provide Chinese, English and Japanese README documents for the convenience of users of different languages.
Project Links: Includes links to technical reports and Hugging Face-related models and datasets, providing easy access to additional resources.

Main elements and features

1. Core functions

Step-Audio is the first production-ready open source framework for intelligent voice interaction that harmonizes speech understanding and generation capabilities with the following functional features:

multilingual dialog: Supports conversations in Chinese, English, Japanese and other languages.
emotional tone: Ability to show different emotional tones such as joy, sadness, etc.
local dialect: Support for local dialects such as Cantonese and Szechuan.
Speech Rate Adjustment: The voice rate can be adjusted.
rhyme scheme: Supports different rhyming styles such as rap.

2. Key technological innovations

Multimodal model with 130 billion parameters
- is a unified model that integrates comprehension and generation capabilities to perform tasks such as speech recognition, semantic understanding, dialog, speech cloning, and speech synthesis.
- Open-sourced Step-Audio-Chat variant with 130 billion parameters.
Generative Data Engine
- Eliminates the dependence of traditional text-to-speech (TTS) on manual data collection and generates high-quality audio through a multimodal model with 130 billion parameters.
- A resource-efficient Step-Audio-TTS-3B model with enhanced command-following capabilities for controlled speech synthesis is trained and disclosed using these data.
Fine voice control
- Precise regulation is achieved through command-based control design, supporting a wide range of emotions (anger, joy, sadness, etc.), dialects (Cantonese, Szechuan, etc.), and vocal styles (rapping, a cappella humming, etc.) in order to meet the diverse voice generation needs.
Enhanced Intelligence
- Improved performance of intelligences in complex tasks through ToolCall mechanism integration and role-playing enhancements.

3. Model architecture

dual-code book framework: Audio streams are tokenized through a dual codebook framework, combining parallel semantic (16.7Hz, 1024-entry codebook) and acoustic (25Hz, 4096-entry codebook) taggers with 2:3 time interleaving.
language model: Continuous audio pre-training of Step-1, a pre-trained text-based Large Language Model (LLM) based on 130 billion parameters, to enhance Step-Audio's ability to efficiently process speech information and achieve accurate speech-to-text alignment.
voice decoder: plays a key role in converting discrete speech tokens containing semantic and acoustic information into continuous time-domain waveforms representing natural speech. The decoder architecture combines a stream matching model and a Mel-to-waveform vocoder trained using a two-code interleaving approach to optimize the intelligibility and naturalness of the synthesized speech.
Real-time reasoning pipeline: An optimized inference pipeline is designed with a core Controller module that manages state transitions, coordinates speculative response generation, and ensures seamless coordination between key subsystems. These subsystems include Voice Activity Detection (VAD) for detecting the user's voice, a streaming audio tagger for real-time audio processing, a Step-Audio language model and speech decoder for processing and generating responses, and a context manager for maintaining dialog continuity.

Warehouse structure

The repository contains the following main folders and files:

Dockerfile respond in singing Dockerfile-vllm: The files used to build the Docker image.
README.md,README_CN.md,README_JP.md: A document describing the project, containing information such as a description of the project, a summary of the model, and how to use it.
requirements.txt respond in singing requirements-vllm.txt: The project's dependencies file, which lists the Python packages needed to run the project.
assets: Stores the project's asset files, such as images, PDF documents, etc.
examples: Stores example code or data.
funasr_detach: May contain code for voice-related functions.
speakers: Stores voice-related prompt audio files and speaker information.
cosyvoice: May contain additional resources related to speech.

Model Download and Use

Model Download: Provides links to download models for both Hugging Face and Modelscope, including Step-Audio-Tokenizer, Step-Audio-Chat, and Step-Audio-TTS-3B.
Model Use: The documentation gives information about the requirements for running Step-Audio models, such as the minimum GPU memory needed for different models.

Step-Audio The repository provides a comprehensive and powerful framework for intelligent voice interaction and is a valuable open source project for both researchers and developers.

Download permission

View

￥

Download for free

Download after comment

Download after login

View demo

{{attr.name}}:

Your current level is

Login for free downloadLogin Your account has been temporarily suspended and cannot be operated！ Download after commentComment Download after paying points please firstLogin You have run out of downloads ( times) please come back tomorrow orUpgrade Membership Download after paying pointsPay Now Download after paying pointsPay Now Your current user level is not allowed to downloadUpgrade Membership

You have obtained download permission You can download resources every daytimes, remaining todaytimes left today

📢 Disclaimer | Tool Use Reminder

1️⃣ The content of this article is based on information known at the time of publication, AI technology and tools are frequently updated, please refer to the latest official instructions.

2️⃣ Recommended tools have been subject to basic screening, but not deep security validation, so please assess the suitability and risk yourself.

3️⃣ When using third-party AI tools, please pay attention to data privacy protection and avoid uploading sensitive information.

4️⃣ This website is not liable for direct/indirect damages due to misuse of the tool, technical failures or content deviations.

5️⃣ Some tools may involve a paid subscription, please make a rational decision, this site does not contain any investment advice.

{{userData.name}}Verify

Step-Audio: Intelligent Voice Interaction in Multiple Languages and Styles

Basic Information

Main elements and features

1. Core functions

2. Key technological innovations

3. Model architecture

Warehouse structure

Model Download and Use

📢 Disclaimer | Tool Use Reminder

delete by encroachment

Contact Customer Service

Business Cooperation

Friendly Link Application

Online Work Order