Vision transformers: The next frontier for deep learning-based ophthalmic image analysis

Jo-Hsuan Wu; Neslihan D Koseoglu; Craig Jones; T Y Alvin Liu

doi:10.4103/sjopt.sjopt_91_23

Vision transformers: The next frontier for deep learning-based ophthalmic image analysis

Saudi J Ophthalmol. 2023 Jul 14;37(3):173-178. doi: 10.4103/sjopt.sjopt_91_23. eCollection 2023 Jul-Sep.

Authors

Jo-Hsuan Wu¹, Neslihan D Koseoglu², Craig Jones^{2

3

4}, T Y Alvin Liu²

Affiliations

¹ Department of Ophthalmology, Shiley Eye Institute and Viterbi Family, University of California, San Diego, La Jolla, CA, USA.
² Department of Ophthalmology, Wilmer Eye Institute, Johns Hopkins University, Baltimore, MD, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
⁴ Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, USA.

Abstract

Deep learning is the state-of-the-art machine learning technique for ophthalmic image analysis, and convolutional neural networks (CNNs) are the most commonly utilized approach. Recently, vision transformers (ViTs) have emerged as a promising approach, one that is even more powerful than CNNs. In this focused review, we summarized studies that applied ViT-based models to analyze color fundus photographs and optical coherence tomography images. Overall, ViT-based models showed robust performances in the grading of diabetic retinopathy and glaucoma detection. While some studies demonstrated that ViTs were superior to CNNs in certain contexts of use, it is unclear how widespread ViTs will be adopted for ophthalmic image analysis, since ViTs typically require even more training data as compared to CNNs. The studies included were identified from the PubMed and Google Scholar databases using keywords relevant to this review. Only original investigations through March 2023 were included.

Keywords: Color fundus photographs; deep learning; ophthalmic image analysis; optical coherence tomography; vision transformers.