
As AI continues to evolve, it’s becoming more imperative than ever to settle one of the biggest issues that coincides with–and in fact contributes to–AI development: the question of the labor behind AI. Our host Carter Considine digs into this issue.
At NeurIPS 2024, OpenAI cofounder Ilya Sutskever declared that AI has reached “peak data,” signaling the end of easily accessible datasets for pretraining models. As the industry hits data limits, attention is shifting back to supervised learning, which requires human-curated, labeled data to train AI systems.
Data labeling is a crucial part of AI development, but it’s also a deeply undervalued task. Workers in low-income countries like the Philippines, Kenya, and Venezuela are paid pennies for tasks such as annotating images, moderating text, or ranking outputs from AI models. Despite the massive valuations of companies like Scale AI, many of these workers face poor pay, delayed wages, and lack of transparency from employers.
Carter also discusses the explosive demand for labeled data, driven by techniques like Reinforcement Learning from Human Feedback (RLHF), which fine-tunes generative AI models like ChatGPT. While these fine-tuning techniques are crucial for improving AI’s accuracy, they rely heavily on human labor, and often under exploitative conditions.
It's worth repeating: We’re going to have to reckon with the disconnect between the immense profits generated by AI companies, and the meager earnings of those who do the essential labeling work.
Synthetic data is often proposed as a solution to the data scarcity problem, but it’s not a perfect fix. Research shows that synthetic data can’t fully replace human-labeled datasets, especially when it comes to handling edge cases.
It’s time to propose ethical reforms in AI development. If we want this technology to continue to evolve at a sustainable pace, we must do what it takes to ensure fair pay, better working conditions, and greater transparency for the workers who make it all possible.
Key Topics:
More info, transcripts, and references can be found at ethical.fm