Audio LLMs

audioLM

Based on SoundStream¹ and w2v-BERT², audioLM³ proposes a framework which consist of three components: tokenizer model, decoder-only Transformer language model, detokenizer model. SoundStream, is neural audio codec with strong performance, which converts input waveforms at 16 kHZ into embeddings while w2v-BERT plays the role to compute semantic tokens.

Figure 2. The three stages of the hierarchical modeling of semantic and acoustic tokens in AudioLM: i) semantic modeling for long-term structural coherence, ii) coarse acoustic modeling conditioned on the semantic tokens and iii) fine acoustic modeling. With the default configuration, for every semantic token there are 2Q′ acoustic tokens in the second stage and 2(Q − Q′) tokens in the third stage. The factor of 2 comes from the fact that the sampling rate of SoundStream embeddings is twice as that of the w2v-BERT embeddings. (Image source: AudioLM: a Language Modeling Approach to Audio Generation)

AudioLM is a pioneering framework that enables the generation of audio with a long-term coherent structure. It uniquely combines adversarial neural audio compression, self-supervised representation learning, and advanced language modeling techniques.

Zeghidour et al., SoundStream: An End-to-End Neural Audio Codec, 2021 ↩︎
Chung et al., W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training, 2021 ↩︎
Borsos et al., AudioLM: a Language Modeling Approach to Audio Generation, 2022 ↩︎

audioLM#

audioLM