🔎 All NSCLC neoadjuvant IO models had high bias risk
🔎 All NSCLC neoadjuvant IO models had high bias risk
A systematic review and meta-analysis of 17 studies covering 44 machine learning models for predicting neoadjuvant immunotherapy response in resectable Non-Small Cell Lung Cancer (NSCLC) found that all models had high risk of bias, even though internal validation performance was moderately strong (pooled AUC 0.786; external AUC 0.760). In the study, 89% of models were rated relatively low quality, with common flaws in sample size, missing-data handling, and validation methods.
Why It Matters To Your Practice
These tools may look promising on paper, but weak methodology can inflate performance and limit bedside reliability.
If you are evaluating AI for neoadjuvant decision support, model governance matters as much as AUC.
The review suggests current evidence is not yet strong enough to support uncritical adoption in routine surgical or perioperative oncology workflows.
Clinical Implications
Internal validation showed pooled sensitivity of 0.763 and specificity of 0.908, but these estimates came from a limited subset of models.
External validation performance was lower, with pooled AUC 0.760, underscoring the risk of performance drop-off outside the development setting.
SVM had the highest reported AUC (0.841), and non-radiomics models outperformed radiomics models (0.869 vs. 0.775).
Predicted AUC was 0.805 for major pathologic response and 0.761 for pathologic complete response.
Insights
The biggest recurring problems were unreasonable sample size, improper handling of missing data, and flawed validation procedures.
Radiomics reporting quality was inconsistent: among 12 radiomics studies, mean RQS was 14.58, indicating substantial reporting and design gaps.
Applicability was rated good overall, suggesting the main barrier is not clinical relevance but trustworthiness of model development.
The Bottom Line
AI models for predicting neoadjuvant immunotherapy response in resectable NSCLC show signal, but the evidence base is methodologically fragile.
For clinicians interested in how AI will affect practice, this review is a reminder to ask not just how accurate a model is, but how it was built, validated, and tested across settings.
Until stronger external validation and better study design emerge, these models are best viewed as investigational rather than practice-changing.