Is AI on the wrong path?
Adrien Foucart, PhD in biomedical engineering.
This website is guaranteed 100% human-written, ad-free and tracker-free. Follow updates using the RSS Feed or by following me on Mastodon
Adrien Foucart, PhD in biomedical engineering.
This website is guaranteed 100% human-written, ad-free and tracker-free. Follow updates using the RSS Feed or by following me on Mastodon
A current trend in deep learning has been bothering me (and many other people, to be fair…). Deep learning has always been about creating bigger, more complex models, trained on more data, to get better results. But it seems to me that the trend is getting worse, and with big potential consequences for the future of AI research. It may just be in my head, but it seems pretty clear from where I’ve been standing in the field. Which is: slightly on the sidelines, as a researcher who works with deep learning, who regularly writes about deep learning, but who doesn’t really directly works on deep learning architectures.
I started my PhD back in 2015, right when TensorFlow was released and deep learning was just seemingly becoming accessible to all. State-of-the-art models could generally be trained from scratch in a day or so, on a (high-end) consumer-grade GPU. The combination of “Big Data” and powerful GPUs had fueled the shift to the deep learning paradigm, so it was natural that a lot of research came into pushing the boundaries of how large, how deep, how complex we could make those models.
But my impression at the time was that there was also a strong concern for efficiency, and for keeping the complexity as low as possible [1, 2]. This has always be, for me, a key design rule for a good model: it needs to be just complex enough to accurately represent the complexity of the distributions in the data. Moving beyond that complexity inevitably leads to overfitting. Another option for avoiding overfitting, of course, is to add more data. And today’s dominant trend seems to be: let’s be as complex as we can, and use pre-training on very large generic datasets (ImageNet, COCO, ADE20K, Cityscapes, Pascal VOC...) to get the required amount of data, before fine-tuning on whatever task we are actually trying to solve (if there is even one: it's often just about beating the benchmark).
I think that’s a worrying trend, for several reasons.
The first is: I’m not convinced it really works that well. It certainly does on tasks that require an extremely diverse set of features. Tasks like image generation and large language models are a good examples. Stable Diffusion, DALL-E, and all of the very impressive demos of huge transformer-based models are amazing at their task, but it sometimes seems that their main task is to make transformer-based models look amazing. Maybe that’s just because I generally work in a field with highly specialized tasks, such as “mitosis counting” or “tumour grading”, where it’s a lot more useful to combine very reliable models focused on very narrow, well-defined tasks rather than trying to have a model moderately successful at many different tasks. Transformers and pre-trained networks are widely used there as well, but they haven’t really demonstrated a clear benefit over well-designed, targeted models.
The second reason I don’t like this trend is that we are centralizing AI research into the hands of a few key players. These very large, very complex models require huge amount of resources, way beyond what’s possible with consumer-grade GPUs. That means that training those models is typically done on large clusters, powered by Amazon’s AWS, Microsoft’s Azure, or Google Cloud. Widespread use of transformers [3] (or vision transformers, etc.) brings more clients to the big cloud providers. Who are, at the same time, driving a lot of the current research and setting the state-of-the-art. Transformers where introduced by a Google team. DALL-E is from OpenAI, with heavy links with Microsoft. Should we see a conspiracy by Google and Microsoft to drive AI research towards paradigms that allow them to centralize all AIs in their clouds? I don’t think it’s as intentional as that (and certainly some Google researchers like François Chollet are speaking out against the waste of resources [4]). What I do think is that the focus on benchmarks such as ImageNet as a measure of the state-of-the-art has falsely given the impression that super-large models where the only way forward, and that there is no incentive for Google or Microsoft (or even Meta, who has plenty of computational resources available to them) to look much further than that.
The extent to which some folks in deep learning research waste entire datacenter-months of computational resources to produce nothing but hot air is just so sad. It's like watching a macaque smear expensive oil paint on thousands of pristine linen canvases.
François Chollet [@fchollet on Twitter]
The third and final reason is becoming even more obvious in the face of climate change and the current energy crisis: this is all very wasteful. I cannot see how using hundreds of thousands of GPU hours to find just the right hyperparameters that make a model gain a fraction of a percent on some general-purpose benchmark is a good use of our limited supplies of energy.
I understand the temptation to go big. Given what deep learning algorithms can do, it sometimes seems like “true” AI is right around the corner. If we just give it a bit more power, a bit more depth, a few billion parameters more... But I don’t think that’s going to work. And I don’t think that’s what we need AI to do right now.
In medical image analysis, the key challenge, in my opinion, is not in designing bigger, more complex networks, but in creating better datasets, clinically relevant tasks, and evaluation methods, in the absence of a clear, definitive ground truth (for more on that, see my PhD thesis [5]!). Other fields will have their own, sometimes similar, challenges. Challenges that won’t be solved by blindly applying huge models, but by actually focusing on the practical applications of those models outside of controlled benchmarks.