Mobile UltimateMobile Ultimate

Titans + MIRAS: Helping AI have long-term memory

Google outlines MIRAS and Titans, a possible path toward continuously  learning AI

The introduction of attention, a mechanism by which models prioritize relevant input data by looking back at earlier inputs, revolutionized sequence modeling. However, computational cost increases drastically with sequence length, which limits the ability to scale Transformer-based models to extremely long contexts, such as those required for full-document understanding or genomic analysis.
The research community explored various approaches for solutions, such as efficient linear recurrent neural networks (RNNs) and state space models (SSMs) like Mamba-2. These models offer fast, linear scaling by compressing context into a fixed-size. However, this fixed-size compression cannot adequately capture the rich information in very long sequences.

We present an architecture and theoretical blueprint that combines the accuracy of transformers with the speed of RNNs in two new papers, Titans and MIRAS. MIRAS is the theoretical framework, or blueprint, for generalizing these approaches, while Titans is the specific architecture (the instrument). Together, they advance the concept of test-time memorization, the ability of an AI model to maintain long-term memory by incorporating more powerful “surprise” metrics (i.e., unexpected pieces of information) while the model is running and without dedicated offline retraining.
The MIRAS framework, as demonstrated by Titans, introduces a meaningful shift toward real-time adaptation. Instead of compressing information into a static state, this architecture actively learns and updates its own parameters as data streams in. This crucial mechanism enables the model to incorporate new, specific details into its core knowledge instantly.

Titans: Quickly acquiring new context

An effective learning system requires distinct yet interconnected memory modules, mirroring the human brain’s separation of short-term and long-term memory.
Titans introduces a novel neural long-term memory module that, in contrast to the fixed-size vector or matrix memory found in conventional RNNs, functions as a deep neural network—specifically, a multi-layer perceptron—while attention mechanisms excel at precise short-term memory. The model can summarize a lot of information without losing important context thanks to this memory module’s significantly higher expressive power. The model isn’t simply taking notes; it’s understanding and synthesizing the entire story.

Importantly, Titans does not simply store data passively. It actively learns how to recognize and retain important relationships and conceptual themes that connect tokens across the entire input. What we refer to as the “surprise metric” is an essential component of this ability. In human psychology, we know we quickly and easily forget routine, expected events but remember things that break the pattern — unexpected, surprising, or highly emotional events.

The “surprise metric” in Titans is when the model notices a significant difference between what it currently remembers and what the new input is telling it.

  • Low surprise: If the new word is “cat” and the model’s memory state already expects an animal word, the gradient (surprise) is low. It can safely skip memorizing the word “cat” in its permanent long-term state.
  • High surprise: The gradient, or surprise, will be extremely high if the new input is a picture of a banana peel and the model’s memory state is summarizing a serious financial report. This indicates that the new input must be prioritized for permanent storage in the long-term memory module because it is significant or unusual.

The gradient, an internal error signal, serves as the mathematical equivalent of the statement, “This is unexpected and important!” in the model. This allows the Titans architecture to selectively update its long-term memory only with the most novel and context-breaking information, keeping the overall process fast and efficient.

This mechanism is improved by including two essential components in Titans:

Momentum: The model takes into account both “momentary surprise,” which is the present input, and “past surprise,” which is the most recent context flow. Even if the tokens themselves are not unexpected, this ensures that any relevant subsequent information is also recorded. Titans use an adaptive weight decay mechanism to manage the memory’s limited capacity when dealing with extremely long sequences of events. This acts as a forgetting gate, allowing the model to discard information that is no longer needed.

MIRAS: A unified view of sequence modeling

From the most recent transformers to the lightning-fast linear RNNs, every major advancement in sequence modeling is essentially the same: a highly complex associative memory module. Accordingly, what makes MIRAS both unique and practical is the way it views AI modeling. It doesn’t see different architectures; rather, it sees different ways to solve the same problem:

effectively combining new information with old memories without forgetting the important ideas. MIRAS defines a sequence model through four key design choices:

  • Memory architecture: The structure that stores information (e.g., a vector, matrix, or a deep multi-layer perceptron, like in Titans).
  • Attentional bias: The internal learning objective the model optimizes that determines what it prioritizes.
    The regularizer for memory is the retention gate. MIRAS reinterprets “forgetting mechanisms” as specific forms of regularization that balance new learning against retaining past knowledge.
  • Memory algorithm: The optimization algorithm used to update the memory.

Transcending the mean squared error paradigm

Mean squared error (MSE) or dot-product similarity is used by nearly all current successful sequence models to determine their bias and retention. This reliance can make models sensitive to outliers and limit their expressive power.

MIRAS transcends this limitation by providing a generative framework to explore a more rich design space informed by the literature in optimization and statistics. This makes it possible to develop novel architectures with regularization and non-Euclidean goals.