audioLM

Based on SoundStream1 and w2v-BERT2, audioLM3 proposes a framework which consist of three components: tokenizer model, decoder-only Transformer language model, detokenizer model. SoundStream, is neural audio codec with strong performance, which converts input waveforms at 16 kHZ into embeddings while w2v-BERT plays the role to compute semantic tokens.

Figure 1. (Image source: AudioLM: a Language Modeling Approach to Audio Generation)

Figure 1. (Image source: AudioLM: a Language Modeling Approach to Audio Generation)

Figure 2. The three stages of the hierarchical modeling of semantic and acoustic tokens in AudioLM: i) semantic modeling for long-term structural coherence, ii) coarse acoustic modeling conditioned on the semantic tokens and iii) fine acoustic modeling. With the default configuration, for every semantic token there are 2Q′ acoustic tokens in the second stage and 2(Q − Q′) tokens in the third stage. The factor of 2 comes from the fact that the sampling rate of SoundStream embeddings is twice as that of the w2v-BERT embeddings. (Image source: AudioLM: a Language Modeling Approach to Audio Generation)

Figure 2. The three stages of the hierarchical modeling of semantic and acoustic tokens in AudioLM: i) semantic modeling for long-term structural coherence, ii) coarse acoustic modeling conditioned on the semantic tokens and iii) fine acoustic modeling. With the default configuration, for every semantic token there are 2Q′ acoustic tokens in the second stage and 2(Q − Q′) tokens in the third stage. The factor of 2 comes from the fact that the sampling rate of SoundStream embeddings is twice as that of the w2v-BERT embeddings. (Image source: AudioLM: a Language Modeling Approach to Audio Generation)

AudioLM is a pioneering framework that enables the generation of audio with a long-term coherent structure. It uniquely combines adversarial neural audio compression, self-supervised representation learning, and advanced language modeling techniques.