Exploring the Potential of Convolutional Models in AI
Written on
Artificial Intelligence has been making headlines, especially with the rise of ChatGPT and the recent release of OpenAI's GPT-4. Companies like Google are gearing up to introduce their own chatbots, leading to an impending competition among large language models (LLMs). While these chatbots are founded on extensive language models, the training process is costly and requires significant infrastructure investments (GPUs or TPUs). This begs the question: Is this truly the most effective technology? Are there potentially superior alternatives?
Until 2017, recurrent neural networks (RNNs) primarily dominated Natural Language Processing (NLP). While techniques like long short-term memory (LSTM) and gated recurrent units (GRU) existed, the landscape was largely RNN-centric.
RNNs, however, faced two significant challenges: - Their sequential nature made parallelization difficult, as they generated a series of hidden states in a linear fashion. - The vanishing or exploding gradient problem hindered the processing of longer sequences due to diminishing gradient values.
In 2014, a pivotal paper from Bengio's group introduced the concept of attention, aiming to adjust the architecture for better information summarization, particularly in machine translation by summarizing preceding and succeeding words.
Though the initial model still utilized RNNs, it laid the groundwork for a transformative shift in NLP, culminating in the 2017 publication of "Attention Is All You Need." In this groundbreaking work, Google unveiled the transformer architecture, which seamlessly integrated various NLP principles (such as attention, embedding, self-attention, and positional encoding) while eliminating the reliance on RNNs.
Why is the transformer considered revolutionary? Besides an initial embedding, it consists solely of multi-head self-attention blocks followed by a feed-forward layer. The authors provided three primary motivations for this approach: - The total computational complexity per layer. - The degree of parallelizable computation, quantified by the minimum necessary sequential operations. - The path length concerning long-range dependencies within the network.
This architecture removes the temporal constraints of RNNs, enabling effortless parallel computations. However, this does not clarify why transformers outperform RNNs significantly.
The true advantages stem from self-attention, allowing the model to: - Grasp the relationships between distant elements within a sequence. - Model diverse types of sequential data, extending its application to images, graphs, audio, and more.
Since 2017, transformers have replaced RNNs in various applications, achieving remarkable success with numerous adaptations across fields. A prevailing theme has emerged: increasing parameters and data leads to unexpected emergent behaviors, such as context learning, solidifying the transformer as the prevailing standard.
However, the self-attention mechanism incurs a considerable computational cost; as the sequence length grows, the computational burden increases quadratically, which is why most transformers typically handle input sequences of 512 tokens. Longer sequences provide more context, potentially allowing models to ingest entire texts, yielding performance and emergent behaviors currently unimaginable.
Efforts to mitigate computational costs have explored alternative self-attention methods, including linear and sparse approximations. While these models offered efficiency gains, they often suffered from reduced performance and expressiveness.
But what makes attention so effective, and why does it facilitate successful behaviors like context learning? To find a substitute for attention, one must answer these questions. Research by Anthropic demonstrated that during training, certain "induction heads" form connections between previous instances in a sequence and the next token, effectively predicting future occurrences based on learned patterns.
These induction heads serve as links between multiple attention heads in a transformer, allowing information transfer from one token to another, effectively copying relevant words from recent inputs directly to the output—ideal for a model with billions of parameters and vast memory capacity.
So, can we achieve similar outcomes without incurring quadratic costs?
Recent explorations have sought to apply transformers to lengthy sequential data (like time series, music, and audio). This raises a question: how can models with extensive context lengths (32k or even 64k) be viable?
The answer lies in Flash Attention, an algorithm designed to enhance the attention mechanism's efficiency without relying on approximations. Essentially, it reorganizes operations for improved computational efficiency and reduced memory usage.
Previous attempts by Stanford researchers aimed to develop a non-quadratic alternative to attention but failed to close the performance gap. However, their work provided insights into attention's effectiveness and developed tools to test alternative attention methods regarding context learning.
Recent collaborative research between the University of Montreal and Stanford introduces the Hyena operator as a potential attention-free alternative. Essentially, this approach reverts to traditional convolution. In earlier studies, convolution was applied to words; this time, however, the kernels enable dynamic filter size adjustments (as suggested by a prior Amsterdam group).
In this framework, the matrix A(x) is not explicitly defined. Instead, a series of filters (convolutions) generate a set of matrices through a linear operator (a feed-forward network).
The process involves linear projections of the input, followed by generating multiple filters parameterized through a feed-forward network (updated during training) and combined with element-wise multiplication.
The matrix relationships enable the model to understand various information segments. Notably, the neural network maintains attention's expressiveness while leveraging convolution, which operates in subquadratic time. This allows for rapid parallel execution of filters.
How fast is Hyena? It operates in subquadratic time compared to transformers and Flash Attention. While the differences may be negligible for short sequences, they become significant beyond a certain threshold. For sequences longer than 6k, Hyena demonstrates lower runtimes, achieving speeds up to 100 times faster at 100k.
The potential for even faster models exists, as convolution parameterizations can scale remarkably well with extensive vocabularies. This sets the stage for rapid development of Hyena-based models.
Yet, does Hyena match the performance of transformers? The authors assessed their model against a dataset called The Pile, a high-quality text collection encompassing sources like PubMed, arXiv, GitHub, and Wikipedia, totaling 825 gigabytes.
Interestingly, Hyena achieved results comparable to GPT with only 20% of the operations.
Evaluating Hyena on a classic benchmark dataset, SuperGLUE, revealed competitive results, demonstrating that Hyena exhibits similar few-shot capabilities as standard transformers, with significant improvements in accuracy for specific tasks.
Furthermore, like transformers' adaptation for images (vision transformers), Hyena can be tailored for similar applications, proving competitive against vision transformers.
If you're interested, the code can be accessed here:
Conclusions
While attention mechanisms yield impressive results, they come with substantial computational demands. The extraordinary performance of models like GPT-4 is contingent upon billions of parameters, necessitating significant GPU resources. Current chip limitations pose challenges for future models like GPT-5.
The authors propose a model based on convolution rather than attention, emphasizing its subquadratic nature. This could pave the way for more efficient models capable of accommodating vast contexts (with aspirations of reaching up to a million tokens).
This might just be the first of many models moving away from self-attention. The authors suggest that there remains ample space for convolution within the transformer-dominated landscape.
If you found this discussion intriguing:
Explore my other articles, subscribe for notifications on new publications, and connect with me on LinkedIn.
Additionally, check out my GitHub repository for a collection of resources related to machine learning, artificial intelligence, and more.
You may also be interested in one of my recent articles:
Level Up Coding
Thank you for being part of our community! Before you go: - Clap for the story and follow the author. - View more content in the Level Up Coding publication. - Check out our free coding interview course. - Follow us on Twitter, LinkedIn, and our newsletter.
Join the Level Up talent collective and discover amazing job opportunities!