Most models today can understand text and images but not sound. In this project, we taught our model nano4M to also work with audio. We built a tool that turns sounds into tokens so the model can learn from them just like it does with words and pictures. Since there was no dataset with all three types of data, including audio, images, and text, we created our own by combining real audio with captions and AI-generated images. In the end, our model learned to connect these different kinds of data and guess missing parts using the others. This shows that combining sound, text, and images in one model is possible and opens the door to new and useful applications.
In this project, we worked on adding audio to our multimodal model, nano4M. Our goal was to turn audio signals into a sequence of tokens, just like how we do with text and images. To do this, we built an audio tokenizer using a model called VQ-VAE. This model helps us take raw audio and convert it into a shorter, discrete representation that a neural network can understand.
We tested different versions of the tokenizer: one using spectrograms, one using raw waveforms, and one with a special decoder called WaveNet to improve the sound quality.
To train nano4M on audio together with images and text, we needed a dataset that aligned all three modalities. Since no public dataset included audio, images, and captions simultaneously, we created our own multimodal dataset. We started with AudioCaps (which provides audio and text), and then generated synthetic images from the captions using a text-to-image model.
Our final system can learn from audio, images, and text together, and can even guess missing pieces from one modality using the others. This opens up many possibilities, like generating sounds from text or understanding audio with the help of images and captions.
As part of our effort to integrate audio into nano4M, we developed a tokenizer based on a Vector Quantized Variational Autoencoder (VQ-VAE). The goal was to obtain discrete audio representations that can be used for multimodal learning alongside text and images.
We experimented with three different audio tokenization architectures: (1) VQ-VAE using mel spectrogram input and Griffin-Lim decoding, (2) VQ-VAE trained directly on raw waveform, and (3) VQ-VAE combined with a WaveNet decoder.
MelSpectrogram/Raw Waveform + L1/STFT Loss
Raw Waveform + WaveNet
We began by training a VQ-VAE on LibriSpeech (100h), using mel spectrograms as input features. While reconstruction losses appeared low, the resulting waveforms reconstructed using Griffin-Lim were highly distorted.
Original
Reconstructed (Griffin-Lim)
We applied L1 loss between original and reconstructed waveforms. Despite improvements, the L1 loss failed to align with perceived audio quality. Adding a Short-Time Fourier Transform (STFT) loss helped, but did not fully resolve the issue.
Original
Reconstructed (VQ-VAE raw waveform)
To achieve higher-quality reconstructions, we replaced the VQ-VAE decoder with a WaveNet conditioned on the quantized latents \(z_q(x)\). This autoregressive decoder models the waveform as a product of conditional distributions:
$$ p(x|h) = \prod_{t=1}^{T} p(x_t | x_{1:t-1}, h),\text{ where } h = z_q(x) $$WaveNet predicts parameters of a mixture of logistic distributions per timestep:
$$ \begin{align} &p(x) =\sum_{i=1}^{K} \pi_i \cdot \text{Logistic}(x|\mu_i, \sigma_i) \\ &Loss_{t} = -\log(p(x_{t}))\\ &Loss_{total} = \sum_{t=1}^{T}Loss_{t} \end{align} $$Initial training on the 100h subset was unstable. We added:
These changes partially stabilized learning. However, audio quality remained limited, likely due to insufficient latent expressiveness.
Original
Reconstructed (VQ-VAE raw waveform)
Our experiments showed that high-quality waveform reconstruction from discrete tokens is challenging. L1 and STFT losses are insufficient alone; autoregressive models like WaveNet help, but depend heavily on latent quality and training stability.
Future work includes exploring HiFi-GAN or WaveRNN as decoders, and adding perceptual losses for better alignment with human judgments.
A key challenge in our work was to find a suitable multimodal dataset containing aligned audio, images and text captions. Surprisingly, we did not find any publicly available dataset containing all three modalities. We ended up settling on AudioCaps dataset, containing, audio samples sourced from YouTube videos along with Human-writted captions describing the audio.
While AudioCaps provides audio-caption pairs, it lacks corresponding aligned images. We considered extracting frames from the source YouTube videos, but this would violate YouTube's Terms of Service and risk account termination. Instead, we generated synthetic images using Dsitilled Stable Diffusion inference conditioned on the text captions.
Caption: "Rain falling and thunder roaring"
Caption: "Food frying with person narrating"
Caption: "Multiple adults speaking, and a child shouting in the background"
While our approach provided a solution to the lack of available dataset, it has some limitations. Synthetic images does not perfectly match the audio content and are sometimes unrecognizable. Furthermore, AudioCaps dataset contains mostly environmental sounds which due to their uniqueness may be hard to learn and generate for a simple model.
Once the dataset was fully tokenized, we trained our model using the nano4M architecture, which is based on a 4M. During training, we intentionally hide a subset of the input tokens across modalities and ask the model to predict them using the available context.
The diagram below illustrates this process. On the left, each input modality—image, audio, and text—is tokenized separately. Then, a fixed number of tokens are randomly selected as inputs to the Transformer encoder, while the remaining ones are treated as targets to be predicted by the decoder. This masked pre-training encourages the model to reason across modalities.
This training strategy—visualized above—forces the model to reason across modalities to recover missing pieces. Whether it’s a missing sound, word, or image patch, the model must infer it using cues from the other modalities.
By repeating this process many times, the model learns to connect and understand how text, images, and audio relate to each other. This helps it build a shared representation of all three, and allows it to do things like generate sound from text, or use audio to help interpret an image.
As a result, we obtained a lightweight but versatile model that can perform basic reasoning across modalities. Our early experiments did not yield significant results. The synthetic images generated from captions lacked sufficient visual clarity, which limited the model's ability to learn meaningful cross-modal representations. Additionally, the audio modality introduced a critical challenge: token granularity. While text and image data can be effectively represented within 256 tokens, this token budget captures only a small fraction of the audio signal. For instance, 5 seconds of audio at 48kHz using EnCodec typically results in approximately 3,000 tokens. As a result, during training, only a small segment of the audio is used, which leads the model to learn to generate short bursts of noise rather than coherent audio sequences..
Below are some of the results we obtained during our evaluation:
"Papericking, then"
"Someone crumples paper."
In order to improve the performance of the base line model, we experimented with two architectural modifications: Muon Optimizer and Rotary Positional Embeddings.
Muon is an optimizer that works alongside AdamW, designed to improve convergence speed. It works only on parameters that are 2D or higher, all other parameters are optimized by AdamW. We used the implementation from Keller Jordan.
We trained nano4M with Muon on the Clever multimodal dataset, and achieved a validation Loss of 3.51, slightly outperforming the baseline AdamW model which achieved a validation loss of 3.53.
Rotary Positional Embeddings (RoPE) encodes both absolute and relative positional information by applying a rotation to the query and key vectors in the attention blocks. They have been shown to improve the model's ability to capture long-range dependencies and relative positions in sequences. We used the implementation from torchtune.
We trained nano4M with RoPE1D and RoPE2D on the Clever multimodal dataset. The RoPE1D model achieved a validation loss of 4.29, which is significantly worse than the baseline sine-cosine positional embeddings (3.53), likely due to the dataset containing mostly 2D modalities. The RoPE2D model, achieved a validation loss of 3.63, which is slightly worse than the baseline, likey due to the sequence length (256) being too short to fully benefit from RoPE encoding.
In this work, we explored the feasibility of integrating an audio modality into nano4M, a lightweight multimodal architecture, using a fully synthetic dataset constructed from AudioCaps. Our objective was to enable joint training across text, image, and audio modalities within a unified framework.
Throughout our experiments, we encountered several key challenges:
Overall, our results highlight both the potential and the challenges of extending lightweight multimodal architectures to include audio. This initial study lays the groundwork for more scalable and expressive models that can fully exploit audio as a first-class modality.