Artificial intelligence (AI) chatbots are everywhere these days, helping us with everything from customer service queries to providing recommendations for our next movie binge. But one of the biggest challenges these chatbots face is the ability to engage in long, uninterrupted conversations without crashing or slowing down. A team of researchers at the Massachusetts Institute of Technology (MIT) alongside others has now developed a solution that keeps these AI chatbots chatting all day long without a hitch.
The team’s method, named StreamingLLM, involves a simple yet effective tweak to the key-value cache at the core of many large language models. This key-value cache acts like a conversation memory, storing recent tokens (data representations like words) for later use. When this cache exceeds its capacity, the usual process would be to bump out the earliest data points, which can lead to the model’s failure.
Instead, the researchers discovered that by keeping these initial data points in memory, the chatbot can continue the conversation regardless of its length. This approach allows a model to remain efficient even when a conversation extends to over 4 million words. Compared to other methods that avoid crashing by continuously recomputing parts of past conversations, StreamingLLM performed more than 22 times faster.
Understanding StreamingLLM
Each chatbot’s large language models encode data like words in user queries into tokens. These tokens are then used to generate new text via an attention mechanism. This mechanism builds an attention map, a kind of grid, that maps out how strongly each token (or word) relates to every other token. It’s this understanding of relationships that allows these large language models to generate human-like text.
However, as the cache grows larger, so does the attention map, which can slow down computation. If encoding content requires more tokens than the cache can hold, the model’s performance slides. To avoid this, researchers employ a “sliding cache” that pushes out the oldest tokens to make room for new ones. But, as soon as that first token is evicted, the model’s performance often plummets, reducing the quality of newly generated words.
The researchers, however, found a way around this. They realized that if they keep the first token in the sliding cache, the model will maintain its performance even when the cache size is exceeded. And this led them to uncover a phenomenon they dubbed “attention sinks.”
The Role of Attention Sinks
Some models use a Softmax operation in their attention mechanism, which assigns a score to each token that represents how much it relates to each other token. Since most tokens aren’t strongly related, their attention scores are very low. The model dumps any remaining attention score in the first token.
“We need an attention sink, and the model decides to use the first token as the attention sink because it is globally visible — every other token can see it. We found that we must always keep the attention sink in the cache to maintain the model dynamics,” explains Song Han, an associate professor in Electrical Engineering and Computer Science (EECS) at MIT, a member of the MIT-IBM Watson AI Lab, and a distinguished scientist of NVIDIA.
In building StreamingLLM, the researchers discovered that having four attention sink tokens at the beginning of the sliding cache led to optimal performance. They also found that the positional encoding of each token must stay the same, even as new tokens are added and others are bumped out.
The result? StreamingLLM can maintain a continuous conversation while outperforming a popular method that uses recomputation.
For instance, when the cache has 256 tokens, the recomputation method takes 63 milliseconds to decode a new token, while StreamingLLM takes 31 milliseconds. However, if the cache size grows to 4,096 tokens, recomputation requires 1,411 milliseconds for a new token, while StreamingLLM needs just 65 milliseconds.
What Does This Mean for AI?
StreamingLLM’s ability to process texts up to 4 million tokens in length is not just impressive but transformative. It has the potential to revolutionize how we approach AI-driven generation applications.
The new method enables AI chatbots to conduct long conversations throughout the workday without needing to be continually rebooted. This is particularly useful in tasks like copywriting, editing, or generating code.
The researchers are also exploring the use of attention sinks during model training. They found that training with attention sinks allowed a model to maintain performance with only one attention sink in its cache, rather than the four that are usually required to stabilize a pretrained model’s performance.
Despite these advancements, there are still limitations to be addressed. For instance, the model cannot remember words that aren’t stored in the cache. Future research will focus on methods to retrieve tokens that have been evicted or enable the model to memorize previous conversations.
This novel approach to AI chatbot conversations has been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM. It’s a significant step forward in the development of long-lasting, efficient AI chatbots.