jkisolo.com

Exploring the Existence of Long-Context LLMs: A Critical Analysis

Written on

Long-context large language models (LLMs) are currently in the spotlight, but do they genuinely exist beyond the marketing claims of various companies?

Image depicting LLM capabilities

As Alexander Smith aptly stated, "A man’s real possession is his memory. In nothing else is he rich, in nothing else is he poor." This sentiment resonates as we navigate the evolving landscape of LLMs, which have transitioned into an era of extended contextual understanding. The competition among LLM developers has rapidly escalated, with capabilities now extending from 4K tokens to models boasting 32K and even reaching 2M tokens.

The growing interest in models that handle extensive context lengths signifies their potential to perform intricate reasoning and process lengthy documents or collections of texts.

A pertinent question arises:

But do LLMs effectively utilize this long context?

Long-context models are typically assessed through three primary methods: - Evaluating the perplexity of the language model over lengthy documents. - Searching for specific information within vast amounts of text. - Conducting question-answering or summarization tasks over extensive documents.

According to the first two evaluation methods, most models appear to leverage their context length efficiently. However, the third method reveals some biases:

> The third method provides a more realistic measure, focusing on retrieving accurate information from lengthy inputs. In question-answering scenarios, LLMs might shortcut by referencing brief snippets to derive answers without engaging with the entire document.

Image illustrating LLM performance metrics

To address these concerns, researchers undertook a thorough evaluation of 13 different models using a novel approach.

> In this study, we propose employing in-context learning (ICL) for extreme-label classification tasks to assess the capabilities of long-context LLMs. ICL requires models to comprehend the entire input to grasp the label space, thus demanding a comprehensive understanding of the input for accurate predictions.

The researchers devised tasks that necessitate the model's analysis of entire documents.

Image showing dataset examples for LLM training

The authors compiled six datasets featuring examples ranging from short to lengthy inputs, with each example assigned a distinct label for clarity. These datasets span various domains, including banking and speeches. For instance, the Banking77 dataset comprises 13,000 examples for 77 intents, while the Discovery dataset contains 174 discourse markers with over 10,000 examples.

A notable aspect of this research is the inclusion of 13 different models, from smaller ones like Gemma and LLaMA (7B parameters) to closed-source models like Gemini and GPT-4, as well as non-transformer models such as RWKV and Mamba.

Image detailing model performance comparisons

The findings indicate that Transformer-based models generally outperform RNN-based models across all evaluated datasets. However, both types lag behind the more powerful API-based models, particularly GPT-4.

The study reveals that model performance tends to decline as task complexity increases. Additionally, open-source models consistently underperform compared to closed-source counterparts, especially GPT-4. Notably, all models struggled with the most challenging dataset (Discovery).

Image demonstrating the impact of input size on model performance

For some models, performance declines linearly with larger input sizes, while others benefit from having more examples in their demonstrations. Nonetheless, this benefit has its limits.

Another intriguing finding is that the arrangement of examples in the input significantly impacts model performance, including for GPT-4. Models like Mistral-7B-v0.2-base and InternL M2–7B-base exhibit considerable performance drops, suggesting heightened sensitivity to the distribution of examples.

Image showing the effects of prompt distribution on model outputs

In summary, this research delves into the capabilities of large language models in long in-context learning tasks, especially within extreme-label classification contexts. The authors curated a dataset named LongICLBench, consisting of long in-context learning tasks with varying difficulty levels based on context length.

The authors observed a decline in performance beyond 20K tokens, highlighting the inadequacy of previous evaluation methods.

> Our findings indicate that while LLMs demonstrate promising capabilities with inputs up to 20K tokens, their ability to process and understand longer sequences significantly diminishes.

Recently, Google announced with great enthusiasm that their latest model could handle up to 1M tokens, prompting discussions about the future of Retrieval-Augmented Generation (RAG):

> Gemini 1.5 Pro features a standard 128,000 token context window. However, starting today, a select group of developers and enterprise customers can experiment with a context window of up to 1 million tokens via AI Studio and Vertex AI in private preview.

These results raise questions about whether Gemini truly analyzes the full context length.

If these models can efficiently utilize 20K tokens, can they genuinely claim to possess extended context length? Additionally, the necessity for examples to be clustered suggests that their arrangement plays a crucial role. It’s possible that evaluation methods are insufficient, allowing models to find information by leveraging the first 20K tokens, particularly for tasks that don’t require detailed knowledge beyond that point. This also serves as a reminder that our understanding of in-context learning remains incomplete, with many aspects still shrouded in mystery.

What are your thoughts on this topic? Feel free to share your insights in the comments.

If you found this article engaging, consider exploring my other works or connecting with me on LinkedIn. You can also check out this repository for weekly updates on ML & AI news. I welcome collaborations and projects, so don’t hesitate to reach out. Subscribe for free to be notified of my new publications.

Image linking to email subscription for updates

Here’s a link to my GitHub repository, where I compile code and various resources related to machine learning, artificial intelligence, and more.

GitHub — SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

You might also be interested in my recent articles:

  • Image Segmentation with Simple and Elegant Methods: Why complex deep learning models aren’t always necessary.
  • Think, Then Speak: Researchers Create an Inner Monologue for AI: Introducing QuietStar, a promising method for LLM reasoning.
  • Is the Great Consolidation Underway?: Analyzing Microsoft's acquisition of Inflection AI as a sign of consolidation.
  • Harnessing the Power of Colors in Python: Discovering the hidden information in color images.

Reference

Here’s a list of the key references I consulted to write this article, with only the first title cited:

  1. Dasigi, 2021, A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
  2. Li, 2024, Long-context LLMs Struggle with Long In-context Learning
  3. Anil, 2022, Exploring Length Generalization in Large Language Models
  4. Chen, 2023, Extending Context Window of Large Language Models via Positional Interpolation
  5. Fu, 2024, Data Engineering for Scaling Language Models to 128K Context

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Transformative Insights: 10 Life Lessons from 2023

Reflecting on 2023 reveals 10 life lessons that inspire growth, connection, and self-awareness for a fulfilling new year.

Unlocking the Secrets of Patterns and Symmetry in Java Classes

Discover how recognizing patterns and symmetry in Java can enhance code readability, reusability, and efficiency.

Embrace Writing on Medium for Passion, Not Just Profit

Discover the joy of writing on Medium for self-expression and connection, rather than solely for financial gain.

Reconnecting: The Key to Overcoming Loneliness and Isolation

Explore how genuine connections can combat loneliness, with strategies for fostering deeper relationships.

Embracing Growth: How to Foster a Supportive Development Culture

Explore ways to cultivate a positive development environment that encourages learning and teamwork.

Creating a Fruit Detection Web App Using Streamlit and Python

This guide covers building a fruit detection web app using Streamlit and Python.

How to Capitalize on the Flourishing Creator Economy Gold Rush

Explore how to thrive in the booming creator economy and understand the true value behind selling dreams.

The Rise and Fall of European Powers in the 16th Century

Explore how 16th-century European powers rose and fell, shaping global dynamics and leading to future conflicts.