
This academic paper investigates the critical issue of data leakage in applying machine learning (ML) to panel data, which combines cross-sectional and time-series observations. The authors explain that standard ML practices, when unsuited for panel data's inherent structure, can lead to temporal leakage (future information affecting past predictions) and cross-sectional leakage (information sharing across training and testing units). This leakage results in inflated model performance and misleading policy recommendations, as empirical applications, particularly for income prediction in U.S. counties, vividly demonstrate. To counter this, the paper offers practical guidelines for practitioners, emphasizing the importance of clearly defining research goals—whether for cross-sectional prediction or sequential forecasting—and implementing appropriate data splitting and cross-validation strategies to ensure robust and realistic ML model evaluation.