Maximum Effective Context Window

Title:Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs

Norman Paulsen Denver, Colorado, USA norman.paulsen@gmail.com

The Maximum Effective Context Window for Real World Limits of LLMs

Large language model (LLM) providers boast big numbers for maximum context window sizes. To test the real world use of context windows, we 1) define a concept of maximum effective context window, 2) formulate a testing method of a context window's effectiveness over various sizes and problem types, and 3) create a standardized way to compare model efficacy for increasingly larger context window sizes to find the point of failure. We collected hundreds of thousands of data points across several models and found significant differences between reported Maximum Context Window (MCW) size and Maximum Effective Context Window (MECW) size. Our findings show that the MECW is, not only, drastically different from the MCW but also shifts based on the problem type. A few top of the line models in our test group failed with as little as 100 tokens in context; most had severe degradation in accuracy by 1000 tokens in context. All models fell far short of their Maximum Context Window by as much as >99%. Our data reveals the Maximum Effective Context Window shifts based on the type of problem provided, offering clear and actionable insights into how to improve model accuracy and decrease model hallucination rates.

CCS Concepts

· Computing methodologies → Artificial intelligence → Natural language processing → Information extraction

Demos

Large Language Models, Context Window, Inference Tokens, Hallucination Rates, LLM Accuracy

References & Citations

Norman Paulsen. 2025. Context Is What You Need:

BibTeX formatted citation

The rise of large language models (LLMs) such as ChatGPT, Claude, Gemini, and LLaMA has reshaped the landscape of natural language processing (NLP), enabling increasingly sophisticated contextual understanding, summarization, coding, and dialog capabilities. Central to the application of these advancements is the concept of the context window (or max context length), the max number of input tokens (words, punctuation, and symbols), a model can consider at one time. The expansion of context windows from hundreds to tens of thousands and, more recently, to millions of tokens, represents a technical triumph. Several methods have been employed to effectively extend context windows (Pawar et al., 2024) yet, an unresolved and often misunderstood question remains: how much of that context can truly be used effectively by the model?

While model specifications cite a maximum context window of 128k, 1 million or even as much as 10 million tokens (Meta 2025a), these numbers reflect architectural or implementation limits, not necessarily the model's practical capacity for handling or reta ining that full input context. Empirical evidence increasingly suggests a divergence between the maximum context window (MCW) and the maximum effective context window (MECW) -the point beyond which additional tokens no longer meaningfully contribute to model output quality. Understanding this discrepancy is vital for the efficient deployment of LLMs in domains that demand long-context comprehension, such as legal reasoning, scientific literature synthesis, financial document analysis, and multimodal temporal correlation in video or audio.

This paper explores the emerging distinction between the MCW and the MECW in contemporary transformerbased LLM's. We propose that while LLM architecture permits long-sequence processing, practical limitations constrain the usable span of context in real-world inference tasks. We define MECW as the longest span of token input, for a given problem type, for which incremental tokens degrade the model's output with measurable effect. This notion reframes the context window not as a flat max input capacity, but as a spectrum of values dependent on the task at hand.

The appeal of longer context windows is intuitive. An LLM capable of digesting entire books, codebases, or sessions without truncation seems closer to achieving general-purpose intelligence. In enterprise settings, longer contexts allow seamless retrieval-augmented generation (RAG), more nuanced chat histories, and document-centric agents capable of reasoning over sprawling datasets. Small context windows limit the practical uses of LLM's.

Yet anecdotal observations from practitioners often contradict this promise. Despite feeding entire books or lengthy transcripts into models with claimed milliontoken capacity, users frequently observe that LLM's fail to answer questions about information embedded in the input sequences. General observations seem to show performance degrades when prompts rely on large context, and models exhibit increased hallucination rates as token counts rise.

However, there is a lack of empirical evidence to support what we've seen anecdotally in the field. In this paper we outline a testing methodology for real world applications of LLMs to find the maximum effect context window. Leveraging the proposed method ology, we've collected hundreds of thousands of data points from the most prominent LLM's on the market. We aggregate that data to show MECW values across a series of different problem types and compare the MCW to the MECW.

This paper makes three key contributions:

● A Formal Definition of MECW: We offer a principled definition of the Maximum Effective Context Window, grounded in informational, theoretical and behavioral criteria. This includes defining effectiveness in terms of fluctuating measurable influence

on model predictions, rather than static inclusion limits like that of MCW.

● Empirical Analysis Across Tasks and Models: We evaluate several state-of-the-art LLMs across a battery of tasks (finding a Needle-in-a-Haystack, finding Needles -in-aHaystack, summarization, finding Needles -in-a-Haystack with sorting) using controlled token context intervals to chart the actual usable range of input. This includes both open-source models (e.g., Mistral, LLaMA, Deepseek) and proprietary APIs (e.g., GPT4o-mini, GPT-5, Claude 3, Gemini 2.5, Gemini 2.0). ● Recommendations for Design and Deployment: Based on our findings, we outline practical guidelines for model architects, prompt engineers, and application developers. These include strategies to optimize RAG pipelines, truncate or summarize distant context, and more realistically context window limit estimates based on MECW rather than MCW.

References & Citations

Understanding the gap between the maximum context window and maximum effective context windows is not just a technical nuance -it is fundamental to how we effectively use and leverage artificial intelligence in real world applications. Misinterpretation of context capacity can lead to inefficient system designs, overinvestment in retrieval techniques that yield diminishing returns, or misaligned user expectations. It can also skew benchmarking results, especially when models are assumed to have uniform memory over arbitrarily long sequences.

Moreover, as LLMs are integrated into systems that simulate long-term memory or perform multi-session reasoning, distinguishing between architectural input size limits and real functional capacity becomes crucial. In cognitive science terms, MECW may be more analogous to "working memory" than to "longterm memory" and recognizing that distinction can lead to more robust, interpretable, and grounded model responses.

2 Related Work

Bookmark

Prior research has shown long context windows suffer in a few different ways. Existing evidence shows models suffer from a placement of data issue. In older models, providing the same data in different orders produces varying results of success retrieving the requested information. Successful retrieval drops from 76% to 66% by moving the key information from position 1 to position 2, and falls under general model performance (with no data in context) when the key information is moved to position 7. Patterns in attention have shown to improve data retrieval but are outside the scope of this experiment (Liu et al., 2023; Hsieh et al., 2024).

Not only does key information placement impact performance, the size of that information matters. Prior research found that the relevant information token count compared to total context token count impacts the successful retrieval rate by up to 25% (Bianchi et al., 2025)

Notable prior research papers show that models handle context window lengths greater than their training-time sequence length poorly. When using context lengths greater than their training-time sequence length, a U-shaped performance curve emerges based on critical information placement. Additional research shows that models start to degrade at half their training length (An et al., 2024; Liu et al., 2023; Press et al., 2022).

Relatively new research poses that model performance on novel tasks, like math and logic problems, suffers from the number of steps needed to complete. The addition of more steps degrades the model's accuracy. We look to show that it's not the number of steps but the token length that causes a breakdown in performance (Xu et al., 2025).

For the purposes of our experiments, we look to negate data positioning as a contributing factor to performance by randomizing the data for each question proposed to the model. This guarantees an even distribution of data placement throughout the context.

2.2 Settings Impacting Performance

Several studies have showcased how model performance varies over a multitude of factors. Model performance relies on several factors including max allowed output tokens, temperature, top_p, and even the python frameworks used (Hochlehnert et al., 2025; Zhao et al., 2025).

Higher temperatures, approaching 1, lead to increased model performance with a tradeoff in reproducibility. For the purposes of our experiments, we use the default temperature of 1. Higher top_p values also lead to improved model accuracy but without the detriment to stability. We also leave top_p constant at their default values to reduce variability across experiments (Hochlehnert et al., 2025).

Max token values have an outsized impact on long query performance. As models approach set max token limits, they begin to truncate responses and provide unfinished answers. Not only has prior research shown this but we saw similar results when we started testing while using token limits (Ding et al., 2025). As a result, we set all token limits to maximum values to allow models to use as many tokens as necessary to provide an accurate answer.

Reasoning and non-reasoning models work in distinctly different ways, leading to large performance gains from reasoning models. Many experiments have compared the contrasted reasoning vs nonreasoning models of the same provider to help benchmark performance differences between the two (Chen et al., 2024; Chen et al., 2025; Chua et al., 2025; Ding et al., 2025; Feng et al., 2025; Li et al., 2025b). Our research is less interested in the distinction between reasoning and non-reasoning models when it comes to the testing framework. Instead we focused on the top performing models from various providers, which were mostly reasoning models.

Fine tuning models on specific tasks, like data extraction from large documents, increases performance for said tasks. Studies found a 10.5% improvement in data retrieval questions on long context windows by fine tuning models on synthetic large context window tasks (Xiong et al., 2024). We did not want to focus the LLM on our tasks and were more interested in model generalization across various tasks.

All settings and frameworks remain constant during our tests to remove as many outside variables as possible. We want to focus on the impact of input context token length as the only variable. To further reduce noise from outside variables, we reran all tests repeatedly until appropriate p-values were achieved.

2.3 Novel Question Performance

Standard model performance frameworks are not built to evaluate long context windows. Furthermore, existing evaluation frameworks, like AIME24, AIME25 and GPQA Diamond, all suffer from random seeding volatility, wide fluctuations in scores due to the small number of questions, and variability across different versions of the frameworks (Hochlehnert et al., 2025).

Small datasets, like those used in AIME24, can significantly misrepresent model performance when used in comparisons. The 30 record dataset used in AIME24 means one missed question impacts reported model performance by 3.3%. Many model providers now retest models multiple times on these same small datasets to provide a more accurate number, however, those results are then impacted by seed parameters (Hochlehnert et al., 2025).

The seed parameter, if not explicitly set, is automatically generated dynamically per inference. This was shown to vary model performance on the same dataset significantly higher than the baseline. Coupled with small datasets, this can result in large fluctuations in standardized model performance frameworks.

For all of these reasons, we create a new testing framework for measuring the specific impact of input token length for a given task. None of the existing testing frameworks provide data in a format sufficient for testing incrementally increasing context lengths for real world use cases.

2.4 Other Frameworks

Other frameworks for testing long context windows have been developed in the last 12 months. Several have focused on the Needle-in-a-Haystack problem, demonstrating the effectiveness and limits of finding a single piece of information in a large context window (Gao et al., 2025; Ling et al., 2025; Nelson et al., 2024). Others focus on complex tasks on a fixed dataset (Bogomolov et al., 2024; Cui et al., 2025; Jacovi et al,. 2025; Zhuang et al,. 2025). None of these focus on incrementally testing model effectiveness on various tasks as token count increases.

ETHIC is designed to test long context tasks to see how well LLM's cover the provided material (Lee et al., 2024). This framework finds similar results but is focused on how to test a model effectively using its long context window while we want to determine the point in which a context window breaks down for a given task.

The DocPuzzle Benchmark provides 100 multidomain cases with verification mechanisms (Zhuang et al,. 2025). While this also focuses on long context data retrieval followed by complex reasoning tasks, it does not provide an incremental token count for the tasks.

CURIE, a scientific long-Context Understanding, Reasoning, and Information Extraction benchmark, also shows models underperform on long contexts (Cui et al., 2025). This benchmark focuses on scientific tasks with predetermined questions and answers which greatly differs from our approach of generating questions with variable context lengths.

The FACTS Grounding Leaderboard is an ongoing benchmark continuously testing model performance with documents up to 32k tokens in length (Jacovi et al,. 2025). Similar to many other long form benchmarks, it only tests accuracy on a fixed set of data with predetermined questions.

Long Code Arena focuses on testing long context windows in a domain specific benchmark for LLM coding (Bogomolov et al., 2024). The benchmark focuses on 6 different aspects of code processing: generation, repair, completion, summarization, processing diffs. This differs from our research which looks at generalized model performance.

The LaRA Benchmark also tests large context windows with a focus on RAG vs long-context windows and finds inconclusive results (Li et al., 2025a). The tests found many factors are at play including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. We narrow our focus to context length by task type to determine the relationship.

U-NIAH, Unified Needle-in-a-Haystack, focuses on comparing LLM long contexts to RAG results to find tradeoffs between the two (Gao et al., 2025). The focus on a single problem type, Needle-in-aHaystack, differs from our framework's focus on context length per question type.

The HELMET Benchmark tests models with various tasks and context sizes, like our approach (Yen et al., 2024). It also finds that models degrade on larger contexts, however, they do not focus on finding the point where models degrade for a given task. Instead, they bucket context windows into one of 5 buckets ranging from 8k to 128k tokens.

Models can be extended to effectively find facts in contexts of extraordinary size. The BABILong Benchmark tests model retrieval capacity exceeding 11 million tokens, the equivalent of 16,800 pages or 85 books (Kuratov et al, 2024). This framework focuses on breaking apart long contexts into smaller chunks referenced via recurrent memory. While this is impressive, we are interested in what are the optimal sized chunks to pass to an LLM.

The LongReason framework provides a set of questions and artificially adds context to the material containing the answer to test models at various context sizes (Ling et al., 2025). Per their research, some key limitations are the fixed questions and the non-complexity of the task. It is mostly Needle-in-aHaystack type problems. Our framework expands on their research by increasing question variability, complexity and specificity of token length.

The NoLiMa Benchmark also tests increasing context lengths using Needle-in-a-Haystack style questions but with a twist. The information requires inference, meaning there is one additional reasoning step required to find the information (Modarressi et al., 2025). They also found similar results to us, that longer contexts degrade performance, but they do not test and compare different problem types across the same model.

The FLenQA data set most closely mirrors our own but with a few distinctions. FLenQA also focuses on showing model performance degradation over increasingly large context windows (Levy et al., 2024). However, they focus on a single type of true/false reasoning question and fill the context with autogenerated text. We diversified the question types and provided only data that could be relevant to the answer, similar to a real-world RAG implementation. We also build upon their findings to show model degradation on context window size is task specific, providing guidance in real world applications.

Many frameworks prior to 2024 also showed similar limitations with large context windows focusing on a novel set of data and novel set of questions, like BAMBOO, L-Eval, LongBench, MuLD (Dong et al., 2023; An et al., 2023; Bai et al., 2023; Hudson et al., 2022). Our focus is building on this great corpus of research by providing practical, real world, guidance

3 Methodology

To answer our research questions -how does MECW compare to MCW across models and task complexities and can this be leveraged to improve model performance -we produced a series of novel questions for LLM's to answer.

This involved creating redundant questions with randomized data, asking these questions of various models repeatedly, slowly increasing the data set size in each round of questions and recording the answers.

3.1 Model Selection

We wanted a wide selection of frontier models from several different providers with open source and proprietary weights. Because of this, we primarily went with reasoning models and excluded small and mid-sized models. Our selection criteria resulted in the following 11 models:

● Open weight: Deepseek.r1-v1:0 (DeepSeekAI et al., 2025), Meta.llama3-3-70b-instructv1:0 (Meta 2025b) ● Closed weight: claude-3-5-sonnet-20241022 (Anthropic 2024), gemini-2.0-flash (Gemini 2025a), gemini-2.5-flash-preview-05-20 (Gemini 2025b), GPT-4.1 (OpenAI 2025a), GPT o4-mini (OpenAI 2025b), GPT-5 (OpenAI 2025c), Grok-3-latest (xAI 2025), mistral-medium-2505 (Mistral AI 2025), Qwen-plus (Quen Team 2025)

3.1 Framework Design

To collect the necessary data, we developed the following framework to test model performance over an increasing number of input token lengths.

3.1.1 Dataset: We defined our own dataset of 10,000 unique names of individuals. Each individual in the dataset was provided a random number, 1-20, of a random item from a list of 15 possibilities. Each item for each person was then assigned a random color out of 9 possibilities. Example data row:

Abigail Holmes has 19 red balloons.

3.1.2 Question types: We defined four distinct questions based on this dataset. 1) Needle-in-aHaystack, a search for a single data point from the data set; 2) Needles -in-a-Haystack, a search for multiple data points from the data set and then sum the total; 3) Summarization, a full sum of all data points in the data set; 4) Find and sort, a search of multiple data points from the dataset then sorted alphabetically by name.

The Needle-in-a-Haystack is the simplest question on our list and, unsurprisingly, the models handled this one the best. For this question, we simply asked for the number of objects a person in the context data had. While all models performed the best at this question type, none managed to effectively find objects up to their MCW.

Figure 1: Needle-in-a-Haystack

In the Needles -in-a-Haystack question, we asked the model to find all instances of an object type or color (randomly selected) and sum up the total. Here we saw a large divergence in model performance between the top performers and lowest performers. The best performers showed reasoning steps that included filtering to the needed information and then summing that shorter list.

For summarization, we simply asked the model to sum up all object totals. All models performed worse than the Needles -in-a-Haystack task, which was unexpected. We assumed it would be harder for a model to perform the multi-step problem (filter and sum) over just one step (sum). This further lends itself to the fact that large context windows are ineffective. The filtered list allowed for a smaller context window on the final step, summarization.

The filter and sort question is the most complex one, requiring a few steps to complete. We ask the model to find the objects of a random type or color, then sort the object counts by owner name, then concatenate the values together in that order.

3.2 Study Setup

To collect our answers from each model, we connected via API's to every model using Python. We stored our initializing dataset and model responses in a Postgres database.

The Python code iterates through a pre-selected range of data points. For each value in that range, we would concatenate that many data points from our data set, formulate a question based on that dataset and then randomize the order of the dataset. The model instructions were simply to answer the question based on the provided data in a specified JSON format, for ease of retrieval.

The resulting sample dataset and question were then fed into each model selected for a given range of data points. The resulting answer from each model was then captured and compared to the correct answer.

3.3 Analysis Procedure

We collected over 66k rows of data, capturing the model name, question type, input token count, and if the correct answer was achieved. For each question and model combination, we validated we captured enough data by measuring the p-value of the relationship between input token count and correct answer (1 for true 0 for false). Because of the pvalues needed for further validation steps (validation of each graphical data point for bucket/model/accuracy combination), the p-values found at this step were always extremely low <1.0e172 (Figures 5-8).

Figure 6

To better tie token input count to correct answer rate, we bucketed input token counts into ranges and averaged the correct answers over the range for each model. For the needle in the haystack question, we used buckets of 5000 tokens because of the large level of accuracy across all models for this question type. For the remaining question types, we used buckets of 100 tokens.

To clean the data for bucketing, we did remove datapoints that fell into a bucket with only 1 or 2 datapoints. This usually occurred on the high end of the token counts where most tests fell in the preceding buckets but one rolled into a new bucket. To prevent skewed results (dramatic swings to 0 or 1), we removed these values to eliminate that bucket from the results.

With buckets we then retested p-values for statistical significance. See appendix A.3 for p-values for bucketed data.

4 Findings for Q1: Does MECW differ from MCW

Using buckets, clear data patterns emerged. Low levels of token counts improved upon published model hallucination rates with high confidence levels (Hughes 2023). As token count increased, all models'

accuracy diverged from their published hallucination rates, providing increasingly erratic results. Model performance, in most cases, could be consistently forced to near 0% accuracy levels if provided too large of a context. These findings indicate that there is a need for a MECW measure across models.

5 Findings for Q2: Do different types of questions change the MECW

Our data provided surprising and clear results in this area. Models perform vastly differently to the type of question asked, as expected. However, we expected model rankings across tasks to remain relatively stable. This was not the case, however. Some models handled the needle in the haystack question far better than their peers but well under performed their peers on other question types. This provides an avenue for further research on model performance across task types: coding, scientific research, general Q/A, mathematics, etc.

6 Additional Findings

6.1 Model Accuracy Using RAG

All models outperformed their standard hallucination rates for our questions, under a certain context size. As context size increased, hallucination rates exceeded base hallucination rates for all models. For the worst performing models, hallucination rates reached near 100%. The same would likely happen to all models, if we continued to test at larger context windows.

Since our line of questioning provided both the facts and the question, it was a simple form of RAG, suggesting, like other research, that RAG increases model accuracy (Li et al., 2025a). Our research expands on this to show accuracy using RAG can reach near 100% levels, if utilized under the MECW. It also shows RAG can worsen model performance when exceeding the MECW.

6.2 Model Selection

Existing production agentic frameworks tend to utilize the best model or multiple models to guarantee accuracy of the results. This comes at a detriment to both cost and speed for responses. Understanding the specific use case and MECW for that use case across models allows for a better weighing of cost and speed when making a model selection. While

OpenAI o4-mini performed at the top in the needles problem, if we are only utilizing 500 input tokens or less, we could use DeepSeek r1 at a fraction of the price with no reduction in accuracy.

The MECW is designed as an effective way to increase LLM accuracy by measuring, understanding and working within the limits of a given model and problem. This is especially useful for agents in an agentic framework. Each agent is designed with a specific task in mind and the MECW can improve each agent's performance to near flawless levels. This is without any further modifications, like temperate, top_p, reverse RAG, or mixture of models to further improve accuracy.

Model rankings also changed across tasks. OpenAI's o4-mini was a top performer in the Needles in a Haystack problem but one of the worst at the Needle in a Haystack problem. This further reinforces the need for an MECW measurement to help select the correct model for the correct task.

6 Discussion

6.1 Implications for GenAI Use

More important than temperature, top_p, seed parameters and other settings, context window size is the most important factor for determining model accuracy. While these other factors do help model accuracy, they only contribute to overall performance by a few percentage points at most (Hochlehnert et al., 2025; Zhao et al., 2025). Context window size can vary a model's performance from near 100% accuracy to near 0% accuracy.

Model Context Windows have grown to outsized amounts as high as 1 million and 10 million tokens. These published limits lead to a false promise of model performance up to that amount. Existing platforms have changed architectures to support these large context Windows with the idea that their output performance would improve. Our research shows real world use cases for LLM's should focus on limiting token count in tasks for best results.

6.2 Need for New Testing Frameworks

Existing testing frameworks, like AIME24, AIME25 and GPQA, provide limited value on model performance in real world use cases. Furthermore, they provide wide swings in measured accuracy because of the small sample rates.

Most applications of Generative AI do not use an LLM alone and leverage some kind of context extension, like RAG. This means we need better testing frameworks for showcasing model performance with more complex use cases. This includes novel question approaches, like those performed by Apple, and context stuffing like what was performed here (Shojaee et al., 2025).

Beyond static testing frameworks, we need a testing framework designed for testing models' MECW across various tasks that can be used by AI developers for their own tasks. Understanding when and where model performance breaks down will help developers understand the limits of a given model in a given context.

6.3 Impact on RAG Systems

Our data does support the notion that RAG systems improve hallucination rates. As an example, GPT-5 did not hallucinate once in our data set, when asked a question with under 500 tokens. The problem becomes that as input token amounts increase the hallucination rate increases. As input token counts reach as little as 2000 tokens, some models' hallucination rates go as high as 99%.

Because of the drastic decline in model performance when using larger context windows, RAG systems leveraging higher token counts decline model performance instead of improving it.

Overall, this leads to cascading failure rates when LLM's are chained together, like in agentic frameworks (Meimandi et al., 2025; Xu et al., 2025). The idea of a near limitless context window leads developers to believe that an agentic system chaining multiple agents with large context windows will perform reasonably well under most situations. As shown through our research, large context windows degrade model performance so agentic systems relying on large context windows for purposes of RAG will see cascading failures.

More importantly, model accuracy can improve above standard hallucination rates simply by providing context windows at the correct size for the model and problem type. This shift in thought prevents cascading model failure by decreasing hallucination rates to a point where chaining agents together will not fail at massively increased rates. Our research concludes anyone leveraging large context windows and/or RAG systems should be aware of the kinds of questions that they are posing to their models and the limits of context windows around those questions to prevent or reduce hallucination rates and improve overall model accuracy.

BibTeX formatted citation

Multivariate testing : Our study focuses on one variable, token count. Isolating this single variable did give us rich results on its impact on LLM performance. However, further testing could be done on token counts tied to other variables, like top-p, to see if another variab le allows for larger MECW's.

Real world problems : Our questions and dataset are very simple. Real world problems might have more structured data input or attached documents, like pdf or excel. Testing the effects of data format could lead to more understanding in how to effectively use a model's context window. Our questions were quite simple. Prompt engineering techniques could be tested to see if there is an uplift in model performance on larger context windows.

7 Conclusion

Our findings conclude that the Maximum Context Window does vary widely from the Maximum Effective Context Window (MECW) for all models tested. Additionally, MECW changes with the type of problem presented to the model. While we did not test every model, we hypothesize these statements hold true for all models currently on the market. Our results suggest effectively using a model's context window is the highest contributing factor to the hallucination rate of the model.

Acknowledgements

A special thank you to all of those who inspired this research, pushed me to continue it and provided support along the way.

References & Citations

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu. 2023. L-Eval: Instituting Standardized Evaluation for Long Context Language Models. ArXiv:2307.11088

Anthropic. 2024. Claude 3.5 Sonnet Model Card Addendum. URL: https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/ Model_Card_Claude_3_Addendum.pdf

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li. 2023. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. ArXiv:2308.14508

Owen Bianchi, Mathew J. Koretsky, Maya Willey, Chelsea X. Alvarado, Tanay Nayak, Adi Asija, Nicole Kuznetsov, Mike A. Nalls, Faraz Faghri, Daniel Khashabi. 2025. Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find. ArXiv:2505.18148

Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, Timofey Bryksin. 2024. Long Code Arena: a Set of Benchmarks for Long-Context Code Models. ArXiv:2406.11612

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez. 2025. Reasoning Models Don't Always Say What They Think. ArXiv:2505.05410

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu. 2024. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs. ArXiv:2412.21187

James Chua and Owain Evans. 2025. Are DeepSeek R1 And Other Reasoning Models More Faithful? ArXiv:2501.08156

Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter

Christian Norgaard, Nayantara Mudur, Martyna Beata Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian A Rohr, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman. 2025. CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning. ArXiv:2503.13517

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan

Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. ArXiv:2501.12948

Bowen Ding, Yuhan Chen, Futing Wang, Lingfeng Ming and Tao Lin. 2025. Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model. ArXiv:2506.23840

Google Gemini. 2025a. Gemini 2.0 Flash Model Card. URL: https://storage.googleapis.com/modelcards/documents/gemini-2-flash.pdf

Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu and Matthias Bethge. 2025. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. ArXiv:2504.07086

Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar,

James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister. 2024. Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization. ArXiv:2406.16008

G Thomas Hudson, Noura Al Moubayed. 2022. MuLD: The Multitask Long Document Benchmark. ArXiv:2202.07362

Simon Hughes, Minseok Bae. 2023. Hughes Hallucination Evaluation Model (HHEM) Leaderboard. https://huggingface.co/ spaces/vectara/Hallucination-evaluationleaderboard

Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, Dipanjan Das. 2025. The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input. ArXiv:2501.03200

Gregory Kamradt. 2023. Needle-in-a-Haystack -pressure testing llms. Accessed: 2025-09-06. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev. 2024. In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss. ArXiv:2402.10790

Taewhoo Lee, Chanwoong Yoon, Kyochul Jang, Donghyeon Lee, Minju Song, Hyunjae Kim, Jaewoo Kang. 2025. ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage. ArXiv:2410.16848

Mosh Levy, Alon Jacoby, Yoav Goldberg. 2024. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models. ArXiv:2402.14848

Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang. 2023. LooGLE: Can Long-Context

Language Models Understand Long Contexts? ArXiv:2311.04939

Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng. 2025a. LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing. ArXiv:2502.09977

Ming Li, Zhengyuan Yang, Xiyao Wang, Dianqi Li, Kevin Lin, Tianyi Zhou, Lijuan Wang. 2025b. What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding. ArXiv:2506.06998

Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen. 2025. LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion. ArXiv:2501.15089

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. ArXiv:2307.03172.

Kiana Jafari Meimandi, Gabriela Aránguiz-Dias, Grace Ra Kim, Lana Saadeddin, Mykel J. Kochenderfer. 2025. The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims. ArXiv:2506.02064

Meta. 2025a. Model Information. URL https://github.com/meta-llama/llamamodels/blob/main/models/llama4/MODEL_CARD .md

Mistral AI. 2025. Medium is the new large. URL: https://mistral.ai/news/mistral-medium-3

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze. 2025. NoLiMa: Long-Context Evaluation Beyond Literal Matching. ArXiv:2502.05167

Elliot Nelson, Georgios Kollias, Payel Das, Subhajit Chaudhury, Soham Dan. 2024. Needle in the

Haystack for Memory Based Large Language Models. ArXiv:2407.01437

OpenAI. 2025b. Introducing OpenAI o3 and o4mini. URL: https://openai.com/index/introducingo3-and-o4-mini/

Qwen Team. 2025. Qwen3: Think Deeper, Act Faster. URL: https://qwenlm.github.io/blog/qwen3/

Saurav Pawar, S.M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Aman Chadha, Amitava Das. 2024. The What, Why, and How of Context Length Extension Techniques in Large Language Models -A Detailed Survey. ArXiv:2401.07872

Ofir Press, Noah A. Smith, Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. ArXiv:2108.12409

Parshin Shojaee , Iman Mirzadeh , Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar. 2025. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. ArXiv:2506.06941

Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. 2024. From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data. ArXiv:2406.19292

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou and Graham Neubig. 2025. TheAgentCompany: Benchmarking LLM Agents

on Consequential Real World Tasks. ArXiv:2412.14161

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen, Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen. 2025. A Survey of Large Language Models. ArXiv:2303.18223

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, Danqi Chen. 2024. HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly. ArXiv:2410.02694

Tianyi Zhuang, Chuqiao Kuang, Xiaoguang Li, Yihua Teng, Jihao Wu, Yasheng Wang, Lifeng Shang. 2025. DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities. ArXiv:2502.17807

A Appendix

A.1 Survey Questions

For the Needle-in-a-Haystack question, we ask the following:

How many objects does {person} have?

A.3 Graphical Data

How many {object} are there?

For the summary question, we simply ask:

For the sorted Needles -in-a-Haystack, we ask a variant of a question depending on a randomly selected color or object type.

Find all people with {color} objects. Sort them by first and last name. Concatenate the number of objects they have into one long string value in the order they were sorted.

A.2 Definitions

Needle In a Haystack

Needles In a Haystack

Sort Question

Summary Question

A.4 P-Value Calculation

Charted P-Values for each bucket for each model for each problem set.

A.4.1 Needle in a Haystack Question

Needle Question	p-value
claude-3-5-sonnet-20241022	4.05e-244
gemini-2.0-flash	0
gemini-2.5-flash-preview-05-20	0
gpt-4.1	0
grok-3-latest	0
mistral-medium-2505	4.86e-298
o4-mini	1.61e-250
qwen-plus	1.98e-294
us.deepseek.r1-v1:0	5.32e-250

us.meta.llama3-3-70b-instruct- v1:0	0.00E+00
Figure 5	Figure 5

Needles Question	p-value
claude-3-5-sonnet-20241022	0
gemini-2.0-flash	0
gemini-2.5-flash-preview-05-20	0
gpt-4.1	0
gpt-5	0
grok-3-latest	0
mistral-medium-2505	0
o4-mini	0
qwen-plus	0
us.deepseek.r1-v1:0	0
us.meta.llama3-3-70b-instruct- v1:0	0

Summary Question	p-value
claude-3-5-sonnet-20241022	6.97e-183
gemini-2.0-flash	1.51e-183
gemini-2.5-flash-preview-05-20	0
gpt-4.1	4.44e-191
gpt-5	0
grok-3-latest	6.44e-194
mistral-medium-2505	8e-189
o4-mini	0
qwen-plus	1.51e-193
us.deepseek.r1-v1:0	0

Sorted Question	p-value
claude-3-5-sonnet-20241022	2.34e-172
gemini-2.0-flash	2.37e-176
gemini-2.5-flash-preview-05-20	0
gpt-4.1	2.28e-182
gpt-5	0
grok-3-latest	1.74e-182
mistral-medium-2505	4.4e-178
o4-mini	0
qwen-plus	2.59e-180
us.deepseek.r1-v1:0	0
us.meta.llama3-3-70b-instruct- v1:0	1.41e-193

	claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct- v1:0
5000	1.01E-164	3.14E-184	9.35e-186	8.69E-185	8.96E-182	5.90E-188	4.51E-183	2.18E-188	2.55E-186	6.76E-187
10000	1.45E-122	4.24E-107	4.15e-104	8.06E-107	6.21E-112	1.25E-115	1.48E-107	7.22E-116	2.08E-103	1.10E-102
15000	9.83E-93	6.11E-128	5.55e-125	7.12E-126	4.21E-132	1.99E-124	3.89E-127	1.01E-124	1.50E-125	1.42E-125
20000	1.11E-111	7.24E-125	7.18e-127	8.92E-126	4.64E-123	6.57E-149	8.33E-126	5.98E-149	2.33E-126	7.66E-126
25000	3.54E-99	1.20E-192	2.08e-184	2.77E-189	3.90E-209	3.78E-145	1.07E-189	7.72E-146	3.82E-185	3.82E-184
30000	8.55E-97	1.46E-175	6.12e-151	6.78E-142	1.21E-101	1.84E-175	6.79E-142	3.09E-182	1.98E-150	4.64E-162
35000	8.76E-152	3.68E-149	8.49e-164	2.84E-169	8.17E-194	1.07E-195	6.22E-159	7.57E-152	1.68E-159	1.60E-172
40000	7.84E-79	1.47E-162	4.31e-178	1.27E-183	1.35E-267			5.38E-26		6.01E-206
45000	9.55E-40	5.71E-177	6.13e-194	4.15E-183	2.01E-169					1.25E-166
50000	5.31E-40	2.09E-211	1.02e-199	4.02E-195	1.89E-193					2.99E-163
55000	1.05E-92	8.98E-211	2.11e-257	3.56E-255	8.64E-229					8.14E-204
60000		2.45E-249	7.64e-238	5.29E-243	7.87E-279					2.99E-222
65000		0.00E+00	3.42e-303	0.00E-02	0.00E+00
70000		1.23E-41	0	2.79E-263	3.17E-269
75000			0
80000			1.52e-86
85000			2.91e-105
90000			2.57e-86

	claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	gpt-5	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct-v1:0
100	2.65e-75	2.59e-136	1.07e-113	4.53e-109	7.64e-81	3.73e-136	9.31e-118	8.1e-131	9.29e-113	1.5e-129	8.53e-105
200	3.3e-137	1.07e-177	7.62e-150	2.15e-153	2.68e-105	2.55e-181	2.37e-163	2.25e-174	3.98e-165	5.22e-174	3.93e-173
300	2.54e-143	2.35e-194	1.58e-173	4.66e-172	4.12e-119	1.37e-202	1.52e-182	1.36e-193	4.03e-183	1.93e-191	5.55e-197
400	4.98e-156	2.7e-216	1.39e-188	6.51e-192	3.02e-136	1.07e-217	8.73e-207	5.31e-216	4.48e-202	1.58e-205	5.99e-217
500	7.09e-176	1.23e-203	8.32e-192	8.38e-199	2.55e-141	7.25e-215	1.32e-197	9.69e-210	4.32e-198	2.6e-216	4.03e-206
600	3.56e-166	4.58e-213	9.18e-191	3.97e-192	8.98e-148	8.6e-214	5.73e-196	8.16e-204	3.53e-199	7.57e-208	1.79e-212
700	2.76e-174	8.17e-198	8.19e-185	6.95e-192	4.83e-159	4.21e-200	1.87e-191	6e-211	1.74e-192	7.12e-205	1.06e-192

800	9.50E-173	4.07E-200	8.98E-189	4.66E-188	4.76E-159	2.77E-212	1.55E-189	2.10E-207	1.38E-191	6.51E-200	3.83E-209
900	1.24E-157	2.74E-119	1.10E-185	5.94E-132	5.95e-176	1.40E-87	4.03E-190	6.69E-216	4.63E-195	1.71E-216	7.06E-189
1000	9.19E-156		2.51E-217		4.17e-173		4.38E-88	1.67E-286	1.42E-97	2.16E-281	2.02E-37
1100	5.62E-176		2.22E-221		1.59e-162			4.35E-287		8.32E-289
1200	1.03E-146		1.68E-206		1.97e-182			3.24E-292		4.40E-283
1300	1.76E-173		2.73E-196		1.01e-193			8.20E-275		8.42E-271
1400	1.66E-171		7.71E-255		1.35e-190			0.00E+00		0.00E-02
1500	3.61E-186		1.43E-196		6.49e-206			2.84E-228		7.45E-257
1600	1.30E-178		6.71E-178		1.94e-210			8.51E-243		1.79E-238
1700	9.83E-198		8.56E-260		1.3e-199			6.54E-263		2.68E-254
1800	1.24E-163		1.14E-251		9.23e-217			1.44E-274		2.45E-260
1900	6.88E-226		2.36E-216		3.48e-224			1.08E-271		3.37E-223
2000	3.73E-191		1.74E-160		1.33e-242			3.36E-203		1.61E-161
2100	4.75E-126				1.65e-246
2200	2.99E-136				7.34e-221
2300	2.15E-193				6.87e-239
2400	4.02E-187				6.13e-207
2500	1.46E-222				1.58e-202
2600	3.09E-196				1.06e-206
2700	1.58E-190				1.14e-235
2800	8.00E-34				7.73e-211
2900					4.46e-218
3000					1.16e-226
3100					1.57e-284
3200					5.74e-229

claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	gpt-5	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct-v1:0
1.19e-38	1.03e-50	4.92e-50	2.99e-50	9.51e-13	5.19e-53	7.28e-48	7.39e-50	1.43e-47	1.31e-50	1.52e-40

100	1.45E-72	2.53E-97	3.97E-95	1.59E-97	1.38E-91	2.85E-99	5.22E-91	1.54E-97	1.48E-90	6.32E-95	9.40E-96
200	6.20E-92	2.96E-122	2.89E-122	3.59E-124	1.53e-113	2.66E-129	1.59E-115	3.91e-125	2.74E-118	2.24E-122	8.52E-120
300	6.10E-100	6.91E-145	1.55E-142	1.54E-143	2.27e-127	6.50E-149	2.48E-131	1.04e-143	3.53E-131	1.78E-144	1.22E-139
400	8.38E-117	1.05E-136	1.59E-148	2.64E-147	2.96e-144	5.87E-138	2.04E-144	7.4e-152	1.61E-139	6.30E-152	1.22E-150
500	3.59E-124	6.05E-14	9.97E-153	6.19E-19	4.73e-152	1.69E-12	7.33E-58	2.18e-177	1.27E-69	6.52E-174	2.53E-62
600	2.19E-89	1.22E-17	1.38E-160	1.30E-14	4.21e-157	8.79E-19	8.32E-14	1.56e-186	1.67E-14	1.51E-183	1.73E-13
700	8.96E-12	1.26E-15	1.78E-86	1.00E-15	3.77e-170	1.17E-14	3.15E-15	5.45e-178	5.39E-15	2.73E-177	4.49E-16
800	3.66E-11	3.57E-15	9.49E-155	2.71E-16	3.91e-177	8.08E-17	5.13E-15	5.02e-193	1.32E-13	4.02E-198	4.25E-17
900	5.88E-14	9.68E-09	1.28E-143	8.59E-12	3.41e-186		6.83E-17	2.33e-177	5.40E-17	4.57E-175	8.49E-16
1000	1.09E-13		1.07E-134		3.03e-188			2.11e-159		1.36E-157
1100	1.34E-09		1.59E-90		1.22e-191			1e-154		1.79E-152
1200	4.09E-13				2.57e-204			7.58e-156		6.88E-149
1300					7.2e-194			1.39e-172		2.05E-164
1400					1.34e-191			1.2e-156		6.00E-172
1500					1.01e-193			2.83e-177		2.29E-159
1600					1.22e-207			4.26e-161		3.94E-167
1700					7.56e-209			1.69e-175		1.90E-181
1800					4.72e-206			3.46e-177		1.60E-169
1900					5.93e-206			3.92e-169		6.20E-182
2000					6.43e-202			1.82e-183		3.50E-178
2100					2.28e-198			2.71e-186		3.55E-171
2200					4.38e-215			7.35e-174		6.49E-188
2300					8.58e-202			1.5e-189		1.89E-159
2400					5.76e-227			7.44e-193
2500					4.46e-230			8.22e-188
2600					5.61e-226			5.31e-188
2700					6.11e-230			1.05e-193
2800					8.15e-201			2.71e-190
2900					4.43e-231			1.1e-193
3000					9.38e-208			1.06e-182
3100					5.29e-261			6.74e-222

3200	5.74E-284	3.02E-147
3300	2.24E-171	1e-134
3400		3.13e-38
3500		2.12e-39
3600		1.64e-45

Sorted Questio n	claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	gpt-5	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct-v1:0
100	5.60E-52	3.49E-72	2.01E-70	6.31E-72	2.94e-61	3.94E-75	6.77E-67	7.22E-72	7.22E-66	1.41E-70	2.06E-68
200	1.97E-71	1.73E-91	3.17E-91	7.91E-93	4.85e-101	5.16E-94	1.10E-85	9.84E-93	3.99E-86	4.11E-92	2.68E-92
300	7.01E-79	2.88E-109	8.87E-108	5.38E-107	6.88e-120	6.22E-116	2.69E-104	6.66E-107	2.89E-103	1.23E-105	5.01E-105
400	1.65E-85	3.06E-115	5.04E-113	5.47E-118	9.61e-125	1.23E-118	2.97E-107	4.78E-117	8.85E-104	2.52E-115	6.10E-111
500	3.82E-90	1.18E-45	1.08E-129	9.22E-60	8.95e-140	3.38E-34	2.40E-88	3.20E-125	5.67E-94	1.06E-123	2.06E-95
600	2.54E-96		2.72E-125		1.7e-153			4.74E-132		5.63E-132
700	8.01E-15		1.21E-117		2.61e-151			2.23E-121		2.26E-121
800			2.86E-110		4.61e-149			1.92E-107		3.86E-109
900			4.54E-116		2.56e-166			5.20E-104		1.23E-106
1000			1.32E-105		1.96e-168			3.61E-111		4.89E-109
1100			1.71E-104		7.24e-175			1.83E-112		8.40E-112
1200			4.11E-114		8.57e-181			8.32E-109		1.30E-113
1300			1.67E-85		5.92e-207			5.41E-80		1.06E-84
1400			7.55E-59		2.71e-191			3.87E-54		5.70E-65
1500					2.44e-169
1600					7.79e-157
1700					1.05e-188
1800					5.17e-199
1900					4e-194
2000					4.11e-191
2100					3.57e-210

Needle Question	p-value
claude-3-5-sonnet-20241022	4.05e-244
gemini-2.0-flash	0
gemini-2.5-flash-preview-05-20	0
gpt-4.1	0
grok-3-latest	0
mistral-medium-2505	4.86e-298
o4-mini	1.61e-250
qwen-plus	1.98e-294
us.deepseek.r1-v1:0	5.32e-250

us.meta.llama3-3-70b-instruct- v1:0	0.00E+00
Figure 5	Figure 5

Needles Question	p-value
claude-3-5-sonnet-20241022	0
gemini-2.0-flash	0
gemini-2.5-flash-preview-05-20	0
gpt-4.1	0
gpt-5	0
grok-3-latest	0
mistral-medium-2505	0
o4-mini	0
qwen-plus	0
us.deepseek.r1-v1:0	0
us.meta.llama3-3-70b-instruct- v1:0	0

Summary Question	p-value
claude-3-5-sonnet-20241022	6.97e-183
gemini-2.0-flash	1.51e-183
gemini-2.5-flash-preview-05-20	0
gpt-4.1	4.44e-191
gpt-5	0
grok-3-latest	6.44e-194
mistral-medium-2505	8e-189
o4-mini	0
qwen-plus	1.51e-193
us.deepseek.r1-v1:0	0

Sorted Question	p-value
claude-3-5-sonnet-20241022	2.34e-172
gemini-2.0-flash	2.37e-176
gemini-2.5-flash-preview-05-20	0
gpt-4.1	2.28e-182
gpt-5	0
grok-3-latest	1.74e-182
mistral-medium-2505	4.4e-178
o4-mini	0
qwen-plus	2.59e-180
us.deepseek.r1-v1:0	0
us.meta.llama3-3-70b-instruct- v1:0	1.41e-193

	claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct- v1:0
5000	1.01E-164	3.14E-184	9.35e-186	8.69E-185	8.96E-182	5.90E-188	4.51E-183	2.18E-188	2.55E-186	6.76E-187
10000	1.45E-122	4.24E-107	4.15e-104	8.06E-107	6.21E-112	1.25E-115	1.48E-107	7.22E-116	2.08E-103	1.10E-102
15000	9.83E-93	6.11E-128	5.55e-125	7.12E-126	4.21E-132	1.99E-124	3.89E-127	1.01E-124	1.50E-125	1.42E-125
20000	1.11E-111	7.24E-125	7.18e-127	8.92E-126	4.64E-123	6.57E-149	8.33E-126	5.98E-149	2.33E-126	7.66E-126
25000	3.54E-99	1.20E-192	2.08e-184	2.77E-189	3.90E-209	3.78E-145	1.07E-189	7.72E-146	3.82E-185	3.82E-184
30000	8.55E-97	1.46E-175	6.12e-151	6.78E-142	1.21E-101	1.84E-175	6.79E-142	3.09E-182	1.98E-150	4.64E-162
35000	8.76E-152	3.68E-149	8.49e-164	2.84E-169	8.17E-194	1.07E-195	6.22E-159	7.57E-152	1.68E-159	1.60E-172
40000	7.84E-79	1.47E-162	4.31e-178	1.27E-183	1.35E-267			5.38E-26		6.01E-206
45000	9.55E-40	5.71E-177	6.13e-194	4.15E-183	2.01E-169					1.25E-166
50000	5.31E-40	2.09E-211	1.02e-199	4.02E-195	1.89E-193					2.99E-163
55000	1.05E-92	8.98E-211	2.11e-257	3.56E-255	8.64E-229					8.14E-204
60000		2.45E-249	7.64e-238	5.29E-243	7.87E-279					2.99E-222
65000		0.00E+00	3.42e-303	0.00E-02	0.00E+00
70000		1.23E-41	0	2.79E-263	3.17E-269
75000			0
80000			1.52e-86
85000			2.91e-105
90000			2.57e-86

	claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	gpt-5	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct-v1:0
100	2.65e-75	2.59e-136	1.07e-113	4.53e-109	7.64e-81	3.73e-136	9.31e-118	8.1e-131	9.29e-113	1.5e-129	8.53e-105
200	3.3e-137	1.07e-177	7.62e-150	2.15e-153	2.68e-105	2.55e-181	2.37e-163	2.25e-174	3.98e-165	5.22e-174	3.93e-173
300	2.54e-143	2.35e-194	1.58e-173	4.66e-172	4.12e-119	1.37e-202	1.52e-182	1.36e-193	4.03e-183	1.93e-191	5.55e-197
400	4.98e-156	2.7e-216	1.39e-188	6.51e-192	3.02e-136	1.07e-217	8.73e-207	5.31e-216	4.48e-202	1.58e-205	5.99e-217
500	7.09e-176	1.23e-203	8.32e-192	8.38e-199	2.55e-141	7.25e-215	1.32e-197	9.69e-210	4.32e-198	2.6e-216	4.03e-206
600	3.56e-166	4.58e-213	9.18e-191	3.97e-192	8.98e-148	8.6e-214	5.73e-196	8.16e-204	3.53e-199	7.57e-208	1.79e-212
700	2.76e-174	8.17e-198	8.19e-185	6.95e-192	4.83e-159	4.21e-200	1.87e-191	6e-211	1.74e-192	7.12e-205	1.06e-192

800	9.50E-173	4.07E-200	8.98E-189	4.66E-188	4.76E-159	2.77E-212	1.55E-189	2.10E-207	1.38E-191	6.51E-200	3.83E-209
900	1.24E-157	2.74E-119	1.10E-185	5.94E-132	5.95e-176	1.40E-87	4.03E-190	6.69E-216	4.63E-195	1.71E-216	7.06E-189
1000	9.19E-156		2.51E-217		4.17e-173		4.38E-88	1.67E-286	1.42E-97	2.16E-281	2.02E-37
1100	5.62E-176		2.22E-221		1.59e-162			4.35E-287		8.32E-289
1200	1.03E-146		1.68E-206		1.97e-182			3.24E-292		4.40E-283
1300	1.76E-173		2.73E-196		1.01e-193			8.20E-275		8.42E-271
1400	1.66E-171		7.71E-255		1.35e-190			0.00E+00		0.00E-02
1500	3.61E-186		1.43E-196		6.49e-206			2.84E-228		7.45E-257
1600	1.30E-178		6.71E-178		1.94e-210			8.51E-243		1.79E-238
1700	9.83E-198		8.56E-260		1.3e-199			6.54E-263		2.68E-254
1800	1.24E-163		1.14E-251		9.23e-217			1.44E-274		2.45E-260
1900	6.88E-226		2.36E-216		3.48e-224			1.08E-271		3.37E-223
2000	3.73E-191		1.74E-160		1.33e-242			3.36E-203		1.61E-161
2100	4.75E-126				1.65e-246
2200	2.99E-136				7.34e-221
2300	2.15E-193				6.87e-239
2400	4.02E-187				6.13e-207
2500	1.46E-222				1.58e-202
2600	3.09E-196				1.06e-206
2700	1.58E-190				1.14e-235
2800	8.00E-34				7.73e-211
2900					4.46e-218
3000					1.16e-226
3100					1.57e-284
3200					5.74e-229

claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	gpt-5	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct-v1:0
1.19e-38	1.03e-50	4.92e-50	2.99e-50	9.51e-13	5.19e-53	7.28e-48	7.39e-50	1.43e-47	1.31e-50	1.52e-40

100	1.45E-72	2.53E-97	3.97E-95	1.59E-97	1.38E-91	2.85E-99	5.22E-91	1.54E-97	1.48E-90	6.32E-95	9.40E-96
200	6.20E-92	2.96E-122	2.89E-122	3.59E-124	1.53e-113	2.66E-129	1.59E-115	3.91e-125	2.74E-118	2.24E-122	8.52E-120
300	6.10E-100	6.91E-145	1.55E-142	1.54E-143	2.27e-127	6.50E-149	2.48E-131	1.04e-143	3.53E-131	1.78E-144	1.22E-139
400	8.38E-117	1.05E-136	1.59E-148	2.64E-147	2.96e-144	5.87E-138	2.04E-144	7.4e-152	1.61E-139	6.30E-152	1.22E-150
500	3.59E-124	6.05E-14	9.97E-153	6.19E-19	4.73e-152	1.69E-12	7.33E-58	2.18e-177	1.27E-69	6.52E-174	2.53E-62
600	2.19E-89	1.22E-17	1.38E-160	1.30E-14	4.21e-157	8.79E-19	8.32E-14	1.56e-186	1.67E-14	1.51E-183	1.73E-13
700	8.96E-12	1.26E-15	1.78E-86	1.00E-15	3.77e-170	1.17E-14	3.15E-15	5.45e-178	5.39E-15	2.73E-177	4.49E-16
800	3.66E-11	3.57E-15	9.49E-155	2.71E-16	3.91e-177	8.08E-17	5.13E-15	5.02e-193	1.32E-13	4.02E-198	4.25E-17
900	5.88E-14	9.68E-09	1.28E-143	8.59E-12	3.41e-186		6.83E-17	2.33e-177	5.40E-17	4.57E-175	8.49E-16
1000	1.09E-13		1.07E-134		3.03e-188			2.11e-159		1.36E-157
1100	1.34E-09		1.59E-90		1.22e-191			1e-154		1.79E-152
1200	4.09E-13				2.57e-204			7.58e-156		6.88E-149
1300					7.2e-194			1.39e-172		2.05E-164
1400					1.34e-191			1.2e-156		6.00E-172
1500					1.01e-193			2.83e-177		2.29E-159
1600					1.22e-207			4.26e-161		3.94E-167
1700					7.56e-209			1.69e-175		1.90E-181
1800					4.72e-206			3.46e-177		1.60E-169
1900					5.93e-206			3.92e-169		6.20E-182
2000					6.43e-202			1.82e-183		3.50E-178
2100					2.28e-198			2.71e-186		3.55E-171
2200					4.38e-215			7.35e-174		6.49E-188
2300					8.58e-202			1.5e-189		1.89E-159
2400					5.76e-227			7.44e-193
2500					4.46e-230			8.22e-188
2600					5.61e-226			5.31e-188
2700					6.11e-230			1.05e-193
2800					8.15e-201			2.71e-190
2900					4.43e-231			1.1e-193
3000					9.38e-208			1.06e-182
3100					5.29e-261			6.74e-222

3200	5.74E-284	3.02E-147
3300	2.24E-171	1e-134
3400		3.13e-38
3500		2.12e-39
3600		1.64e-45

Sorted Questio n	claude-3- 5-sonnet- 20241022	gemini- 2.0-flash	gemini- 2.5-flash- preview- 05-20	gpt-4.1	gpt-5	grok-3- latest	mistral- medium- 2505	o4-mini	qwen-plus	us.deepse ek.r1-v1:0	us.meta.lla ma3-3-70b- instruct-v1:0
100	5.60E-52	3.49E-72	2.01E-70	6.31E-72	2.94e-61	3.94E-75	6.77E-67	7.22E-72	7.22E-66	1.41E-70	2.06E-68
200	1.97E-71	1.73E-91	3.17E-91	7.91E-93	4.85e-101	5.16E-94	1.10E-85	9.84E-93	3.99E-86	4.11E-92	2.68E-92
300	7.01E-79	2.88E-109	8.87E-108	5.38E-107	6.88e-120	6.22E-116	2.69E-104	6.66E-107	2.89E-103	1.23E-105	5.01E-105
400	1.65E-85	3.06E-115	5.04E-113	5.47E-118	9.61e-125	1.23E-118	2.97E-107	4.78E-117	8.85E-104	2.52E-115	6.10E-111
500	3.82E-90	1.18E-45	1.08E-129	9.22E-60	8.95e-140	3.38E-34	2.40E-88	3.20E-125	5.67E-94	1.06E-123	2.06E-95
600	2.54E-96		2.72E-125		1.7e-153			4.74E-132		5.63E-132
700	8.01E-15		1.21E-117		2.61e-151			2.23E-121		2.26E-121
800			2.86E-110		4.61e-149			1.92E-107		3.86E-109
900			4.54E-116		2.56e-166			5.20E-104		1.23E-106
1000			1.32E-105		1.96e-168			3.61E-111		4.89E-109
1100			1.71E-104		7.24e-175			1.83E-112		8.40E-112
1200			4.11E-114		8.57e-181			8.32E-109		1.30E-113
1300			1.67E-85		5.92e-207			5.41E-80		1.06E-84
1400			7.55E-59		2.71e-191			3.87E-54		5.70E-65
1500					2.44e-169
1600					7.79e-157
1700					1.05e-188
1800					5.17e-199
1900					4e-194
2000					4.11e-191
2100					3.57e-210

Title:Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs

CCS Concepts

Demos

References & Citations​

BibTeX formatted citation​

References & Citations​

2 Related Work

Bookmark​

2.2 Settings Impacting Performance

2.3 Novel Question Performance

2.4 Other Frameworks

3 Methodology

3.1 Model Selection

3.1 Framework Design

3.2 Study Setup

3.3 Analysis Procedure

4 Findings for Q1: Does MECW differ from MCW

5 Findings for Q2: Do different types of questions change the MECW

6 Additional Findings

6.1 Model Accuracy Using RAG

6.2 Model Selection

6 Discussion

6.1 Implications for GenAI Use

6.2 Need for New Testing Frameworks

6.3 Impact on RAG Systems

BibTeX formatted citation​

7 Conclusion

Acknowledgements

References & Citations​

A Appendix

A.1 Survey Questions

A.3 Graphical Data

A.2 Definitions

Needle In a Haystack

Sort Question

Summary Question

A.4 P-Value Calculation

References & Citations

BibTeX formatted citation

References & Citations

Bookmark

BibTeX formatted citation

References & Citations