This research paper investigates the impact of different language models (LLMs) used as "teachers" to generate synthetic responses for instruction tuning. The authors demonstrate a surprising phenomenon they call the "Larger Models' Paradox," where larger and supposedly "stronger" teacher models do not always lead to improved instruction-following abilities in smaller base models. They propose a novel metric called Compatibility-Adjusted Reward (CAR) to better predict the effectiveness of teacher models, taking into account the compatibility between the teacher and the base model being fine-tuned. The study challenges the common assumption that larger LLMs are always better teachers and suggests that a more nuanced understanding of compatibility is needed for successful instruction tuning.
This research paper investigates the potential for large language models (LLMs) to self-improve in long-context reasoning, which involves processing and understanding complex information spread across long stretches of text. The authors propose a novel approach called SEALONG that leverages the LLMs' ability to generate multiple outputs for a given question and then scores these outputs using a method called Minimum Bayes Risk (MBR). The MBR approach prioritizes outputs that align better with each other, thereby filtering out outputs that might be incorrect or hallucinatory. SEALONG then uses these high-scoring outputs for further training, either through supervised fine-tuning or preference optimization. The authors demonstrate through extensive experiments that SEALONG significantly improves the long-context reasoning performance of LLMs without requiring expert model annotations or human labeling.
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models, introduces a new method for generating 3D models using large language models (LLMs). The authors address the challenge of tokenizing 3D mesh data for LLMs by representing the mesh data as plain text using the OBJ file format, a standard text-based format for 3D models. This approach allows for direct integration with LLMs without modifying the vocabulary or tokenizers, minimizing additional training overhead. The study then introduces LLAMA-MESH, a fine-tuned LLaMA model that can generate 3D meshes from textual prompts, produce interleaved text and 3D mesh outputs, and understand and interpret 3D meshes. LLAMA-MESH achieves comparable mesh generation quality to models trained from scratch while maintaining strong text generation abilities, demonstrating the potential for LLMs to become universal generative tools for multiple modalities.
The researchers introduce LLaVA-o1, a vision language model designed to perform structured reasoning by breaking down problem-solving into four distinct stages: summary, caption, reasoning, and conclusion. They compiled a new dataset, LLaVA-o1-100k, and proposed a stage-level beam search method to improve model performance during inference. Experimental results demonstrate that LLaVA-o1 outperforms existing open-source and even some closed-source models on multimodal reasoning benchmarks, emphasizing the effectiveness of its structured reasoning approach.
The BlueLM-V-3B, a multimodal large language model (MLLM) designed specifically for mobile devices. The researchers address the challenges of deploying large models on mobile phones, such as limited memory and processing power, by implementing a novel algorithm and system co-design approach. This includes a dynamic resolution scheme that optimizes image processing and a token downsampler that reduces the number of image tokens to improve inference speed. The paper emphasizes BlueLM-V-3B's superior performance compared to other models of similar size and its high deployment efficiency on mobile devices.
CORAL, a novel benchmark dataset for evaluating Retrieval-Augmented Generation (RAG) systems in a multi-turn conversational setting. The authors highlight the limitations of existing datasets in assessing conversational RAG and detail CORAL's unique features, including open-domain coverage, knowledge intensity, free-form responses, topic shifts, and citation labeling. They explain how CORAL is derived from Wikipedia, automatically converting its content into conversational formats, and outline the three core tasks it supports: conversational passage retrieval, response generation, and citation labeling. The authors present a unified framework for evaluating conversational RAG methods and report on experiments conducted on CORAL, showcasing the performance of different conversational search and generation models.
This research paper surveys small language models (SLMs) and explores their applications, design, training, and model compression techniques. The authors explain that while large language models (LLMs) have proven effective, their resource demands have led to the development of SLMs, which are more efficient and can be deployed on a wider range of devices. The paper examines various techniques to optimize SLMs, including lightweight model architectures, efficient self-attention mechanisms, and model compression strategies such as pruning, quantization, and knowledge distillation. The authors discuss the challenges associated with SLMs, such as hallucination, bias, and energy consumption, and offer suggestions for future research. The goal of this work is to provide a comprehensive resource for researchers and practitioners working with small language models.
This research explores whether transformers, a type of neural network architecture, can learn to reason implicitly over knowledge. The authors find that transformers can learn to reason implicitly, but only through a phenomenon called grokking, where training extends far beyond overfitting. The study investigates two reasoning types: composition and comparison. They find that while the transformers generalize well on in-distribution examples for both types, they struggle with out-of-distribution generalization for composition but succeed for comparison. Through mechanistic analysis of the model’s internals, they discover that different circuits are formed during grokking for each reasoning type, which explains the varying levels of systematicity. The authors also demonstrate the potential of parametric memory for complex reasoning tasks with large search spaces, showing that a fully grokked transformer can achieve near-perfect accuracy, while state-of-the-art LLMs with non-parametric memory fail.
This research paper details the development of Llama 3, a large language model with improved capabilities in language understanding, code generation, mathematical reasoning, and multimodality. The paper emphasizes the importance of high-quality data, scaling up compute power, and using simple, efficient methods to achieve optimal results. It also explores the integration of vision and speech capabilities into Llama 3, highlighting the benefits of a compositional approach. The paper concludes with a discussion of safety measures implemented in Llama 3 to mitigate potential risks and ensure responsible use of the model.
Unlike traditional Multi-Layer Perceptrons (MLPs), which have fixed activation functions on nodes, KANs have learnable activation functions on edges. This seemingly simple change allows KANs to outperform MLPs in terms of accuracy and interpretability, particularly for small-scale artificial intelligence and scientific tasks. The text explores the mathematical foundations of KANs, highlighting their ability to overcome the curse of dimensionality and achieve faster neural scaling laws than MLPs. Additionally, the text showcases KANs' potential for scientific discovery by demonstrating their effectiveness in uncovering mathematical relations in knot theory and identifying phase transition boundaries in condensed matter physics.
The document describes the development of MMIE, a large-scale benchmark designed to evaluate the performance of Large Vision-Language Models (LVLMs) in interleaved multimodal comprehension and generation tasks. MMIE comprises a dataset of 20,000 meticulously curated multimodal queries across various domains, including mathematics, coding, and literature, which are designed to challenge LVLMs to produce and interpret both images and text in arbitrary sequences. The authors also propose a reliable automated evaluation metric for MMIE, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria. Extensive experiments demonstrate the effectiveness of the benchmark and metrics, revealing significant room for improvement in the development of interleaved LVLMs. The paper provides detailed insights into the benchmark's construction, evaluation methods, and error analysis, offering valuable guidance for future research in multimodal learning.
This research paper proposes a novel method called Thought Preference Optimization (TPO) to train large language models (LLMs) to "think" before responding to user instructions. TPO utilizes a preference-based training framework where LLMs generate internal thoughts alongside their responses, and these thoughts are then optimized based on the quality of the resulting responses. The authors argue that this approach, unlike previous methods relying on direct supervision, allows LLMs to develop thinking abilities for a broader range of tasks beyond traditional reasoning and problem-solving. They demonstrate the effectiveness of TPO on benchmark datasets and observe that LLMs trained with TPO show improvements even in non-reasoning categories like language and translation, marketing, and health, highlighting the potential for thinking-based LLMs in diverse applications.
The paper, "VIT-LENS: Towards Omni-modal Representations," introduces a novel approach to enable Artificial Intelligence (AI) agents to perceive information from various modalities beyond just vision and language. It proposes a method that leverages a pre-trained visual transformer (ViT) to efficiently encode information from diverse modalities, such as 3D point clouds, depth, audio, tactile, and electroencephalograms (EEG). By aligning these modalities with a shared embedding space, VIT-LENS unlocks a range of capabilities for AI agents, including any-modality captioning, question answering, and image generation. The paper presents extensive experimental results demonstrating that VIT-LENS achieves state-of-the-art performance on various benchmark datasets and outperforms prior methods in understanding and interacting with diverse modalities.
This research paper proposes a new method for efficiently training linear transformers, which are a type of neural network that uses linear attention to process sequences of data. Unlike traditional transformers, which have quadratic complexity in sequence length, linear transformers can process long sequences in linear time, making them more efficient for certain tasks. However, existing linear transformers have been shown to struggle with tasks that require long-range dependencies or the ability to retrieve information from a large context. The authors address this limitation by introducing a novel algorithm called DeltaNet, which utilizes a delta rule-like update to improve associative recall over long contexts. DeltaNet is parallelized across sequence length using a memory-efficient representation for computing products of Householder matrices, making it suitable for training on modern hardware. The authors demonstrate that DeltaNet outperforms other linear-time baselines, particularly on recall-intensive tasks, and that DeltaNet can also be effectively combined with other types of attention mechanisms to create hybrid models that achieve even better performance.
This research explores how the architecture of pre-trained language models influences their base capabilities, specifically focusing on the FFN-Wider Transformer architecture. The study identifies a key factor in model performance: the contribution ratio of the Multi-Head Attention (MHA) layer, which acts as a combination function that reflects the model's ability to combine linguistic features. The authors demonstrate that FFN-Wider Transformers reduce the contribution ratio of this combination function, leading to a decline in base capabilities. To address this issue, they propose a Combination Enhanced Architecture (CEA) that redistributes the wider FFN layer, enhancing the combination function and ultimately improving base capabilities. The effectiveness of CEA is further validated by its successful application to Mixture of Experts (MoE) Transformers, highlighting its potential for broader architecture improvement.
This research paper proposes a new method called Reinforcement Learning from Execution Feedback (RLEF) to improve the ability of large language models (LLMs) to generate code that successfully completes tasks. The authors demonstrate the effectiveness of RLEF by training LLMs on a challenging competitive programming benchmark called CodeContests. RLEF trains the models to iteratively generate code based on the feedback received from running their code against test cases. The results show that RLEF significantly improves solve rates and reduces the number of code samples needed compared to previous approaches, achieving state-of-the-art performance. The paper also investigates the inference-time behavior of RLEF-trained LLMs, highlighting their ability to effectively learn from feedback and make targeted improvements over multiple code generations.
This research paper investigates how variations in the phrasing of prompts impact the performance of large language models (LLMs) across 120 tasks and five models. The study systematically analyzes six families of paraphrase types, including morphology, syntax, lexicon, lexico-syntax, discourse, and others, to determine their influence on model outputs. The findings demonstrate a potential for significant performance gains when prompts are adapted using specific paraphrase types, particularly morphology and lexicon changes. The research also considers factors like prompt complexity, temperature, and proximity to training data, concluding that smaller models are more sensitive to paraphrase changes and can potentially achieve comparable performance to larger models through prompt engineering.
LLaMA-Omni, designed to improve the seamless interaction between speech and large language models (LLMs). This model integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder, allowing it to generate text and speech responses directly from speech instructions with minimal latency. To enhance the model's performance, the authors create a speech instruction dataset called InstructS2S-200K containing 200,000 speech instructions and corresponding speech responses. Experimental results demonstrate that LLaMA-Omni provides superior responses in both content and style compared to previous speech-language models, achieving a response latency of 226 milliseconds. Furthermore, the model's training process is efficient, requiring less than 3 days on 4 GPUs.
"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models," details the development of a new family of multimodal language models (VLMs) called Molmo. Molmo is notable for its open-weight and open-data approach, meaning the model's weights, training data, and code are publicly available. This contrasts with the current trend of proprietary VLMs which keep their models closed. Molmo achieves state-of-the-art performance by utilizing a novel image captioning dataset called PixMo, collected from human annotators using speech-based descriptions. This approach avoids reliance on synthetic data generated by proprietary systems, enabling the creation of performant VLMs without the need for distilling closed models. The authors highlight Molmo's potential for various tasks, including question answering and image-based navigation.
This technical paper proposes a novel technique called Low-Rank Adaptation (LoRA) for adapting large language models (LLMs) to specific downstream tasks. LoRA addresses the challenge of fine-tuning LLMs, which requires updating all model parameters, by injecting low-rank decomposition matrices into each layer of the Transformer architecture. This significantly reduces the number of trainable parameters, resulting in a substantial decrease in storage requirements, memory usage, and training time. The paper shows that LoRA performs comparably or even better than fine-tuning on various tasks, including natural language understanding (NLU) and generation (NLG), while providing additional benefits such as efficient task switching and lower hardware barrier to entry. The paper concludes by investigating the low-rank structure of model updates, providing insights into the effectiveness of LoRA and the underlying mechanisms of model adaptation.