TTS | Loong's Lens

audioLM Based on SoundStream1 and w2v-BERT2, audioLM3 proposes a framework which consist of three components: tokenizer model, decoder-only Transformer language model, detokenizer model. SoundStream, is neural audio codec with strong performance, which converts input waveforms at 16 kHZ into embeddings while w2v-BERT plays the role to compute semantic tokens. Figure 1. (Image source: AudioLM: a Language Modeling Approach to Audio Generation) Figure 2. The three stages of the hierarchical modeling of semantic and acoustic tokens in AudioLM: i) semantic modeling for long-term structural coherence, ii) coarse acoustic modeling conditioned on the semantic tokens and iii) fine acoustic modeling....