
The source is a research paper that systematically examines multi-token prediction (MTP) capabilities within large language models (LLMs) that were initially trained for next-token prediction (NTP). The authors show that these LLMs inherently possess MTP ability through numerical marginalization, which improves as the model size increases, but they note that this is computationally complex. The study explores the challenge of adapting frozen LLMs for MTP by adding prediction heads, finding that the models’ hidden layers are heavily specialized for NTP, which complicates adaptation. Ultimately, the researchers demonstrate that while joint training of the LLM backbone and MTP heads improves performance, a significant gap remains compared to the marginalization baseline, suggesting further investigation is necessary to overcome the specialization barrier