scorecardresearch

Google AudioPaLM: A large language model that can speak and listen

In order to process and produce both spoken language and textual material, AudioPaLM creates a comprehensive multimodal framework.

advertisement
Google AudioPaLM can generate text with your voiceartificial intelligence
Google AudioPaLM can generate text with your voice
profile
New Delhi, UPDATED: Jun 26, 2023 17:21 IST

Highlights

  • Google has released AudioPaLM, a new language model
  • AudioPaLM integrates the best of PaLM-2 and AudioLM
  • It provides zero-shot speech-to-text translations, multilingual voice transmission, and more

Google researchers have developed AudioPaLM, a massive language model that can handle problems involving voice creation and comprehension, to advance the field of audio generation and understanding.

As its essential structure, AudioPaLM leverages the power of a large-scale Transformer model. It enhances a pre-existing text-based LLM's vocabulary with customised audio tokens. This, together with a simple task description, allows for the training of a single decoder-only model capable of performing a variety of tasks that use both voice and text in diverse combinations. Speech recognition, text-to-speech synthesis, and speech-to-speech translation are examples of these tasks.

advertisement

What’s AudioPaLM & how it works?

Google has launched a new multimodal language called AudioPaLM. In order to provide a unified multimodal architecture that can analyse and output both text and speech, the AudioPaLM model is created.

AudioPaLM overview
AudioPaLM by Google

AudioPaLM integrates the benefits of two current models, namely the PaLM-2 model, which was unveiled at Google I/O 2023, and the AudioLM model. This enables AudioPaLM to manage a range of applications, from speech recognition to voice-to-text conversion.

AudioPaLM: Performance & features

AudioPaLM has a wide range of applications, including voice recognition and speech-to-speech translation. It can recognise speech, translate using original voices, do zero-shot speech-to-text translations for many languages, and transfer voice between languages based on short prompts.

advertisement

In addition, it can convert voices across languages based on brief spoken prompts and can record and replicate separate voices in several languages, enabling voice adaptation and conversion.

As mentioned in the press release, the model outperforms previous techniques in terms of speech quality and voice preservation while performing Speech-to-Speech Translation with unknown speakers' voices transferred.

Furthermore, AudioPaLM can produce transcripts in the original language or directly as a translation, as well as generate speech in the original source. Based on automated and human review, the system is predicted to exceed existing systems in terms of voice quality.

Upcoming updates

According to the press release, Audio tokenization offers more chances for study, with the goal of identifying desired audio token features, creating assessment tools, and optimising appropriately. As present benchmarks mostly focus on voice recognition and translation, new established benchmarks and metrics in generative audio tasks are also required to further research.

Published on: Jun 26, 2023 17:21 ISTPosted by: samira siddiqui, Jun 26, 2023 17:21 IST

COMMENTS 0

Advertisement
Recommended