Meta
Meta Chameleon
Meta Chameleon is a family of models that can combine text and images as input and output any combination of text and images with a single unified architecture for both encoding and decoding. While most current late-fusion models use diffusion-based learning, Meta Chameleon uses tokenization for text and images. This enables a more unified approach and makes the model easier to design, maintain, and scale. The possibilities are endless—imagine generating creative captions for images or using a mix of text prompts and images to create an entirely new scene.
Multi-Token Prediction
Using Multi-Token Prediction, we train language models to predict multiple future words at once—instead of the old one-at-a-time approach. This improves model capabilities and training efficiency while allowing for faster speeds. In the spirit of responsible open science, we’re releasing the pre-trained models for code completion under a non-commercial/research-only license. We hope this enables the research community to investigate our method and the trained models’ behaviors independently.
AudioSeal
AudioSeal, is the first audio watermarking technique designed specifically for the localized detection of AI-generated speech, making it possible to pinpoint AI-generated segments within a longer audio snippet. AudioSeal revamps classical audio watermarking by focusing on the detection of AI-generated content rather than steganography. Unlike traditional methods that rely on complex decoding algorithms, AudioSeal’s localized detection approach allows for faster and more efficient detection.
DIG In
Evaluates potential geographical disparities in text-to-image models. In addition, to understand how people in different regions vary in their perceptions of geographic representation, we conducted a large-scale annotation study. We collected more than 65,000 annotations and more than 20 survey responses per example covering appeal, similarity, consistency, and shared recommendations for improved automatic and human evaluations of text-to-image models.
JASCO
Capable of accepting various conditioning inputs, such as specific chords or beats, to improve control over generated music outputs. Specifically, we apply information bottleneck layers in conjunction with temporal blurring to extract relevant information with respect to specific controls. This allows the incorporation of both symbolic and audio-based conditions in the same text-to-music generation model.
IBM and NASA
Transformer-based language models
Transformer-based language models — which include BERT, RoBERTa, and IBM’s Slate and Granite family of models, are invaluable for a range of natural language understanding tasks. What powers these models is a statistical understanding of how language works. They are trained on masked language modeling tasks, which learns by reconstructing sentences with words that have been obscured. Tokenizers, which break down words into units for the model, play a critical role in learning a vast vocabulary. While general-purpose text training is effective with popular tokenizers trained on datasets like Wikipedia or BooksCorpus, scientific domains require specialized tokenizers for terms like “phosphatidylcholine.”
IBM-NASA
The IBM-NASA models, trained on domain-specific vocabulary, outperformed the open RoBERTa model by 5% on the popular BLURB benchmark, which evaluates performance on biomedical tasks. It also showed a 2.4% F1 score improvement on an internal scientific question-answering benchmark and a 5.5% improvement on internal Earth science entity recognition tests.
Retrieval augmented generation (RAG)
RAG commonly follows a two-step framework: a retriever model first encodes the question and retrieves relevant documents from a vector database. These documents are then passed to a generative model to answer the question while ensuring fidelity to the retrieved document