Revolutionizing AI: Microsoft's LongNet and the Future Ahead
Written on
The Future of ChatGPT with LongNet
Imagine a chatbot capable of processing the entire Internet in one go. This is the vision behind Microsoft's latest framework, LongNet, which promises to handle prompts of up to 1 billion tokens—equivalent to a human's lifetime reading—in just half a second.
In stark contrast, the most advanced AI chatbot currently, Claude, manages only 100,000 tokens, roughly 75,000 words, akin to the length of a complete Harry Potter book. While impressive, that number pales in comparison to LongNet's capabilities.
With LongNet, we are on the brink of a new frontier for Generative AI models, potentially enabling them to absorb vast amounts of information instantaneously. This significant leap may bring us closer to achieving Artificial General Intelligence (AGI) and eventually superintelligence. But how has Microsoft made such a groundbreaking advancement?
If you wish to stay informed about the rapidly evolving AI landscape and feel motivated to engage with it, consider subscribing to my free weekly AI newsletter.
The first video, "The Future of Work With AI - Microsoft March 2023 Event," explores how AI is set to transform the workplace.
Attention Mechanism: The Heart of AI
At the core of every AI chatbot lies a groundbreaking discovery: Transformers. This architecture has revolutionized Natural Language Processing (NLP) models, including ChatGPT, and is now making waves in Computer Vision (CV) too.
Visual Transformers employ a similar structure to process images, treating visual elements as tokens, much like words in a text application. But what sets this architecture apart is the attention mechanism, which enables machines to grasp context in language.
Understanding Context with Attention
Simply put, attention allows words within a text sequence to interact, helping the model discern their relationships and meanings. Each word is converted into an embedding, which is then associated with query, key, and value vectors. The query vector interacts with the key vectors of prior tokens in the sequence, facilitating this 'conversation' among words, which is essential for understanding context.
However, standard attention has significant limitations due to its computational cost. The relationship between sequence length and operational expense is quadratic; doubling the text length quadruples the cost.
Why is this the case? In traditional Transformers, each word must engage with all preceding tokens, making the process highly resource-intensive and necessitating strict limits on input sizes.
The Importance of Length
The more context provided, the better the response. Machines, like humans, benefit from extensive context to deliver accurate answers. A chatbot can summarize a book more effectively when given the full text instead of just a chapter.
Unfortunately, current models have restricted input capacities—ChatGPT allows 32,000 tokens, while Claude permits up to 100,000. This limitation hampers their performance, especially in applications requiring rich context.
Nevertheless, one of the most compelling features of large language models (LLMs) is their ability to learn from limited data in real-time. By supplying relevant information in the prompt, users can guide the model to generate more accurate responses without relying solely on pre-existing knowledge.
Enter LongNet: A Breakthrough in Attention
LongNet introduces a new Transformer architecture utilizing dilated attention, which significantly reduces the computational cost of processing lengthy sequences. With this approach, the runtime remains under one second, even for sequences containing 1 billion tokens—roughly equivalent to reading 10,000 Harry Potter novels in half a second!
But how does this work?
The second video, "What runs ChatGPT? Inside Microsoft's AI supercomputer," delves into the technology powering these advancements.
Sparsifying Vectors for Efficiency
LongNet is not the first attempt to mitigate computational challenges. Sparse Transformers aimed to limit interactions among words to cut down costs. However, they often compromised model quality. In contrast, LongNet maintains or even improves perplexity—a measure of predictive accuracy—while reducing operational costs.
This is achieved by defining segment communication lengths, allowing words to interact only with others in their segment. Moreover, LongNet employs a dynamic approach to varying communication patterns, striking a balance between computational efficiency and information retention.
A Step Toward AGI
For humanity to realize AGI, machines must process limitless sequence sizes. With LongNet's advancements, AI can efficiently manage large datasets, focusing on relevant information without getting bogged down by irrelevant details.
This architectural breakthrough paves the way for innovations that aim to enhance existing frameworks, bringing us closer to achieving generational advancements in AI technology.
Link to paper