The August 9, 2023 paper introduces the **Retentive Network (RetNet)**, a proposed foundational architecture for large language models intended to succeed the **Transformer** model. RetNet aims to overcome the Transformer's inefficiencies during inference by simultaneously achieving **training parallelism**, **low-cost inference**, and **strong performance**, a combination previously considered an "impossible triangle." The core of RetNet is the **retention mechanism**, which supports three computation paradigms—**parallel, recurrent, and chunkwise recurrent**—to enable efficient training and constant-time, O(1) inference, leading to significant reductions in GPU memory, latency, and increased throughput compared to the Transformer. Experimental results across various model sizes and tasks demonstrate that RetNet is competitive in performance and offers superior efficiency in both training and deployment.
Source:
https://arxiv.org/pdf/2307.08621
The October 30, 2025 **technical report** details the development and evaluation of **Kimi Linear**, a novel **hybrid linear attention architecture** for large language models (LLMs). The core innovation is the **Kimi Delta Attention (KDA)** module, which refines existing linear attention mechanisms to achieve superior performance and efficiency compared to traditional full attention, particularly in **long-context scenarios**. Empirical results from extensive pretraining and fine-tuning experiments demonstrate that Kimi Linear **outperforms baselines** across various tasks, including general reasoning and code generation, while significantly reducing **memory usage** and increasing **decoding throughput**. The report also includes a **complexity analysis** and a detailed discussion of KDA's relationship to other efficient attention and state-space models.
Source:
https://arxiv.org/pdf/2510.26692
The April 22, 2022 collaboration between University of Washington, Facebook AI and the Allen Institute for AI introduces Attention with Linear Biases (ALiBi), a novel and efficient method for position representation in transformer models that effectively addresses the challenge of **extrapolation**—a model's ability to maintain performance on input sequences longer than those used during training. The authors demonstrate that traditional position encoding methods, like sinusoidal embeddings, fail to extrapolate efficiently, while alternatives like the T5 bias are computationally costly. **ALiBi improves extrapolation** by biasing query-key attention scores with a distance-proportional penalty, eliminating the need for positional embeddings entirely. This approach is shown to be **faster and more memory-efficient** than baselines, enabling a large 1.3 billion parameter model trained on shorter sequences to achieve comparable or superior perplexity scores when evaluated on significantly longer sequences. The findings suggest that ALiBi's performance gains when extrapolating are primarily due to mitigating the "early token curse" common in sequence-splitting evaluation methods.
The August 26, 2024 academic paper introduces **Quest**, a novel algorithm designed to improve the inference efficiency of **Long-Context Large Language Models (LLMs)** by addressing the costly self-attention process caused by a large Key-Value (KV) cache. Quest utilizes **Query-Aware Sparsity** to dynamically identify and select only the **critical KV cache pages** based on the current query token, which significantly reduces the required memory movement during decoding. Unlike previous **Query-Agnostic** methods that evict tokens based on past information, Quest maintains high accuracy by never fully discarding context and achieves substantial speedups in self-attention latency, demonstrating its effectiveness across various long-context tasks. The authors provide a detailed breakdown of the methodology and experimental results showing Quest's superior efficiency and accuracy compared to existing baselines.
Source:
https://arxiv.org/pdf/2406.10774
The September 19, 2025 Alibaba paper introduces **Flash-LLM**, a novel software framework designed to enable **cost-effective and highly-efficient inference for large generative models** by supporting unstructured sparsity on high-performance tensor cores. The authors observe that the primary bottleneck in large language model (LLM) inference is the memory bandwidth limitation during "skinny" matrix multiplications, rather than the arithmetic processing of tensor cores. Flash-LLM addresses this through a **"Load-as-Sparse and Compute-as-Dense" methodology**, which minimizes global memory access by loading sparse data but utilizes tensor cores efficiently by transforming it to a dense format in on-chip memory. Extensive evaluations demonstrate that Flash-LLM significantly outperforms state-of-the-art libraries like Sputnik and SparTA at the kernel level and achieves substantial end-to-end throughput improvements and lower inference costs compared to frameworks like DeepSpeed and FasterTransformer on large OPT models. The paper also details the specialized techniques developed for the framework, including a **Tiled-CSL sparse format** and a two-level overlapping computation pipeline.
Source:
https://arxiv.org/pdf/2309.10285
The February 12, 2025 KuaiShou Inc paper introduces **ELASTIC**, an Efficient Linear Attention for SequenTial Interest Compression framework designed to address the **scalability issues** of traditional transformer-based sequential recommender systems, which suffer from quadratic complexity with respect to sequence length. ELASTIC achieves this by proposing a **Linear Dispatcher Attention (LDA) layer** that compresses long user behavior sequences into a more compact representation, leading to **linear time complexity** and significant reductions in GPU memory usage and increased inference speed. Furthermore, the framework incorporates an **Interest Memory Retrieval (IMR) technique** that uses a large, sparsely retrieved interest memory bank to expand the model's capacity and **maintain recommendation accuracy** despite the computational optimizations. Empirical results from experiments on datasets like ML-1M and XLong demonstrate that ELASTIC **outperforms baseline methods** while offering superior computational efficiency, especially when modeling long user sequences.
Source:
https://arxiv.org/pdf/2408.09380
On October 29, 2025 Anthropic presented research investigating the existence of **functional introspective awareness** in large language models (LLMs), specifically focusing on Anthropic's Claude models. The core methodology involves using **concept injection**, where researchers manipulate a model's internal activations with representations of specific concepts to see if the model can accurately **report on these altered internal states**. Experiments demonstrate that models can, at times, notice injected "thoughts," distinguish these internal representations from text inputs, detect when pre-filled outputs were unintentional by referring to prior intentions, and even **modulate their internal states** when instructed to "think about" a concept. The findings indicate that while this introspective capacity is often **unreliable and context-dependent**, the most capable models, such as Claude Opus 4 and 4.1, exhibit the strongest signs of this ability, suggesting it may emerge with increased model sophistication.
Source:
https://transformer-circuits.pub/2025/introspection/index.html
The October 24, 2025 collaboration between many universities have published a paper thst compares the performance of **Large Language Models (LLMs)** and **Small Language Models (SLMs)** on requirements classification tasks within software engineering. Researchers conducted a preliminary study using eight models across three datasets to address concerns about the **high computational cost and privacy risks** associated with using proprietary LLMs. The results indicate that while LLMs achieved an average F1 score only 2% higher than SLMs, this difference was **not statistically significant**, suggesting that SLMs are a **valid and highly competitive alternative**. The study concludes that SLMs offer substantial benefits in terms of **privacy, cost efficiency, and local deployability**, and found that dataset characteristics played a more significant role in performance than did model size.
Source:
https://arxiv.org/pdf/2510.21443
The June 5, 2025 collaboration between University of Edinburgh and Nvidia paper introduces the concept of **inference-time hyper-scaling** for large language models (LLMs), which aims to boost reasoning accuracy by allowing for longer or more parallel token sequences within the same computational budget. The core bottleneck is identified as the size of the key–value (KV) cache, which grows linearly and dominates inference cost. To address this, the authors propose **Dynamic Memory Sparsification (DMS)**, a novel, data-efficient method for compressing the KV cache by learning an adaptive token eviction policy with a **delayed eviction mechanism**. Experiments across various LLMs and reasoning tasks demonstrate that DMS significantly outperforms existing compression methods, effectively expanding the token budget and achieving superior accuracy at comparable runtime and memory loads.
Source:
https://arxiv.org/html/2506.05345v1
The October 21, 2025 collaboration paper between UW-Madison and Amazon Web Services discuss the critical role of the **Multi-Layer Perceptron (MLP) intermediate size f_size as the primary architectural component for introducing non-linearity and complexity within Large Language Models (LLMs). The MLP layer achieves this by taking the hidden state d_model projecting it up to the expanded f_size, applying a **non-linear gating function** (like SwiGLU), and then projecting it back down. The balance between the MLP and the attention layers is governed by the **mlp-to-attention ratio r_mlp/attn, which is essential for maximizing accuracy (by minimizing training loss) and optimizing inference efficiency (by boosting throughput). Extensive scaling law analysis demonstrates that both the hidden size and the r_mlp/attn exhibit a **U-shaped relationship with training loss**, confirming that careful tuning of these architectural parameters is necessary to achieve optimal model performance and inference speed.
Source:
https://arxiv.org/pdf/2510.18245
We cover two new innovations from Microsoft extending ideas from the original old **FlashAttention**. Flash Attention is an IO-aware attention algorithm for Transformers designed to address the quadratic time and memory complexity of standard self-attention on long sequences. By using **tiling and recomputation** to minimize slow **High Bandwidth Memory (HBM)** accesses in favor of fast **on-chip SRAM**, FlashAttention achieves significant wall-clock speedups for training models like BERT and GPT-2, enabling them to handle much longer context lengths. Microsoft's new **ATTENTION2D** is a technique that builds upon memory-efficient methods like FlashAttention to optimize **distributed self-attention** across multiple GPUs, achieving parallelism in two dimensions (Q-DIM and KV-DIM) to overcome the communication bottleneck inherent in prior single-dimension parallel approaches like Ring Attention. Microsoft's additional contribution to the research community is **Lean Attention**, which also appears to propose a high-performance, tiled execution strategy for attention, using shared memory and iterative computation, similar to the IO-aware concepts in the other sources.
Sources:
The original flag attention paper:
https://arxiv.org/pdf/2205.14135
Flash attention 2 paper:
https://arxiv.org/pdf/2307.08691
June 28, 2025 Microsoft's Attention2D:
https://arxiv.org/pdf/2503.15758
Microsoft's Lean attention:
https://www.microsoft.com/en-us/research/wp-content/uploads/2024/05/Lean_Attention___arxiv_version.pdf
The provided text introduces **Sentence-BERT (SBERT)**, a modification of the popular **BERT** and **RoBERTa** language models, designed to efficiently generate **semantically meaningful sentence embeddings**. The authors address the significant **computational overhead** of using standard BERT for tasks requiring sentence-pair comparisons, such as semantic similarity search and clustering, which can take hours for large datasets. SBERT utilizes **siamese and triplet network structures** to create fixed-size sentence vectors that can be quickly compared using metrics like **cosine-similarity**, drastically reducing the computation time from hours to seconds while **maintaining or exceeding accuracy**. Evaluation results demonstrate that SBERT significantly **outperforms other state-of-the-art sentence embedding methods** on various Semantic Textual Similarity (STS) and transfer learning tasks. Ultimately, SBERT makes **BERT usable for large-scale applications** where the original architecture was too slow.
Source:
https://arxiv.org/pdf/1908.10084
The source provides excerpts from a scientific paper introducing **TxGNN**, a novel graph foundation model designed for **zero-shot drug repurposing**, which aims to identify therapeutic candidates even for diseases with no existing treatments or limited molecular data. Developed by researchers affiliated with institutions like Harvard Medical School and Stanford University, this model leverages a **medical knowledge graph (KG)** and a graph neural network (GNN) to predict drug indications and contraindications across over 17,000 diseases, demonstrating significant performance improvements over existing methods. The paper highlights TxGNN’s ability to generate **multi-hop interpretable explanations** for its predictions, fostering trust and aiding human experts, and validates its clinical relevance by showing alignment with **off-label prescriptions** observed in electronic medical records (EMRs). Overall, the work presents a comprehensive AI framework to systemize and enhance drug repurposing, particularly for neglected or rare diseases.
Source:
https://pmc.ncbi.nlm.nih.gov/articles/PMC11645266/
These April 29, 2024 paper provides an overview of the challenges associated with using **NVIDIA's Multi-Instance GPU (MIG)** technology, specifically focusing on the address translation mechanism in the **A100 GPU**. The papers reveal, primarily through **reverse-engineering efforts**, that the L2 and L3 Translation Lookaside Buffers (**TLBs**) utilize a compression design where each entry comprises **16 sub-entries** to enhance memory capacity management. A major problem arises because the **L3 TLB is shared** across all isolated MIG instances, causing contention that results in frequent evictions and low utilization of these sub-entries. To mitigate this performance degradation, the sources propose **STAR**, a novel hardware solution that dynamically enables the sharing of TLB sub-entries among different base addresses to improve overall efficiency.
Source:
https://arxiv.org/pdf/2404.18361
The August 26, 2025 collaboration between Stanford, NVIDIA, Shanghai Jiao Tong University, University of Michigan, University of Colorado Boulder, Carnegie Mellon University introduces **Strata**, a hierarchical context caching framework designed to improve the performance of serving Large Language Models (LLMs) with long context windows. The core problem Strata addresses is that while caching key-value (KV) states is essential for efficiency, transferring large, fragmented cached contexts from slower memory tiers (like CPU memory) back to the GPU creates **severe I/O bottlenecks and performance stalls**. It also describes why paged attention creates data fragmentation when offloading even though its goal is to address memory fragmentation. That is paged attention becomes an issue when using offloading due to large contexts. Strata overcomes these issues through two main innovations: **GPU-assisted I/O** to mitigate data fragmentation and achieve high bandwidth utilization, and **cache-aware request scheduling** to intelligently form balanced batches and overlap unavoidable I/O stalls with complementary tasks. The evaluation shows that Strata significantly reduces the **Time-To-First-Token (TTFT)** and increases throughput compared to state-of-the-art serving systems like vLLM + LMCache and TensorRT-LLM on long-context benchmarks.
Source:
https://arxiv.org/html/2508.18572v1
This is a classic review of a now old but yet still important paper, the original Flash Attention paper. We review this in light of advances in compiler technology.
The June 23, 2022 Stanford paper describes the original **FlashAttention**, an innovative, IO-aware algorithm designed to significantly enhance the efficiency of the attention mechanism in Transformer models by optimizing memory usage and access. Standard attention suffers from complexity that scales **quadratically** ($O(N^2)$) with sequence length ($N$) for both memory footprint and access to slow High Bandwidth Memory (HBM), which creates a performance bottleneck. FlashAttention overcomes this by employing **tiling and recomputation** within a single customized CUDA kernel, dramatically reducing the memory footprint to scale **linearly** ($O(N)$) and eliminating the quadratic term in HBM access complexity. While the algorithm does not reduce the total Floating Point Operations (FLOPs) and even slightly increases them due to recomputation, the massive reduction in slow memory transfers results in substantial **wall-clock runtime speedups** during both training and inference.
Source:
https://arxiv.org/pdf/2205.14135
On October 20, 2025 Hugging Face released **MTEB v2**, a significant refactoring of the Massive Text Embedding Benchmark, which was originally designed for evaluating text embedding models across various tasks like classification and retrieval. The update addresses **package bloating and the need for broader support** by introducing a **more consistent interface, better typing, and improved documentation**. Key new features include support for **multimodal evaluation (text, images, and audio)**, **unified retrieval and reranking tasks**, and an **easier evaluation process** using the new `mteb.evaluate` function and `ResultCache` for managing results. The article also provides detailed instructions for **upgrading from MTEB v1**, including how to convert old models and datasets to the new v2 format.
Source:
https://huggingface.co/blog/isaacchung/mteb-v2
The October 10, 2025 paper from the University of Michigan and **Google DeepMind** concerning the phenomenon of **"overthinking" in Large Language Models (LLMs)** that utilize chain-of-thought (**CoT**) reasoning. The authors introduce a systematic analyzer called **TRACE** to structurally examine an LLM's thought process, decomposing it into sub-thoughts and progression graphs to move beyond superficial, length-based metrics of overthinking. Benchmarking across various tasks reveals that "thinking models" often waste significant computational resources on simple queries without notable accuracy gains, operating **five to twenty times slower** than non-thinking counterparts. The study identifies two primary overthinking patterns—**Explorer** (characterized by over-exploration and backtracking) and **Late Landing** (marked by excessive self-verification)—and proposes a **utility-based redefinition of overthinking** focused on diminishing marginal returns of subsequent thoughts.
Source:
https://arxiv.org/pdf/2510.07880
The October 23 2025 research paper **probes the spatial reasoning capabilities of Large Language Models (LLMs) when processing text-based inputs**, specifically focusing on how performance degrades as task complexity increases. Using a suite of five grid-based tasks—including quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding—the authors tested four models: GPT-4o, GPT-4.1, and two variants of Claude 3.7. The key finding is that while models achieve **moderate success on smaller grids**, their accuracy rapidly deteriorates as grid dimensions scale up, demonstrating a **significant gap between linguistic and robust spatial representation** in their architectures. Notably, the **Anthropic models consistently outperformed the OpenAI variants**, though all models exhibited weaknesses, such as frequent miscounting, mathematical errors, and difficulty maintaining board state in complex scenarios. The study concludes by emphasizing the **fragility of LLM spatial reasoning** at scale and suggesting future work on improving text-based spatial data representation and mathematical capabilities.
Source:
https://arxiv.org/pdf/2510.20198
This October 23, 2025 Xidian University academic survey systematically reviews the transformative impact of **Large Language Models (LLMs)** on the three core stages of **Knowledge Graph (KG) construction**: ontology engineering, knowledge extraction, and knowledge fusion. The text explains that LLMs are shifting the paradigm from rigid, rule-based systems to **unified, adaptive, and generative frameworks**. The paper is structured to first revisit traditional KG methodologies before examining emerging LLM-driven approaches, which are categorized into **schema-based** (emphasizing structure) and **schema-free** (emphasizing flexibility) paradigms across all stages. The authors outline how LLMs function as either **ontology assistants (top-down)** or as consumers of KGs for grounding and memory **(bottom-up)**, culminating in a discussion of future directions such as KG-based reasoning and dynamic knowledge memory for agentic systems. Ultimately, the work aims to clarify the evolving relationship between symbolic knowledge engineering and neural semantic understanding.
Source:
https://arxiv.org/pdf/2510.20345