To directly relate tissue abnormalities to dysfunctional voicing, it is decisive to temporally resolve the vocal fold movement during phonation on the microscopic level. High-speed video (HSV) can record the vocal folds with 2,000-4,000 fps. Ultra-high resolution optical coherence tomography can distinguish cellular layers with a resolution better than 5 μm within a tissue depth of 1 mm. In this review, we propose combining the two technologies and apply deep learning-based image segmentation to establish statistical evident and reproducible documentation for voice-related diseases.