
The document introduces STEER-ME, a new benchmark designed to assess the microeconomic reasoning abilities of Large Language Models (LLMs), specifically focusing on non-strategic settings like supply and demand analysis. To address the limitations of existing benchmarks, the researchers taxonomize microeconomic reasoning into 58 distinct elements, covering areas like consumption decisions, production decisions, and market equilibrium. The benchmark utilizes a novel, automated data generation protocol called auto-STEER to create a large, varied set of multiple-choice questions, mitigating the risk of LLMs overfitting to evaluation data. A case study involving 27 LLMs demonstrated significant performance variation, highlighting that even sophisticated models often rely on shortcuts or produce "near-miss" solutions when faced with complex computational or conceptual tasks.