Patients' Unplanned Extubation (UEX) is dangerous in the intensive care units (ICU), it is necessary to make early warning of UEX. However, the low fine-grained action of UEX and complexity of ICU environment make early warning a great challenging by using RGB video data. To address this issue, we propose a novel lightweight Spatial-Temporal Transformer (STformer) for early warning of patients' UEX action in the ICU. Specially, the SlowFast is used to extract patient's spatial-temporal features initially. Then, in order to improve the representation of features, we introduce spatial attention to enhance the spatial representation of fine-grained actions, and capture the long-term dependency of motions through temporal attention. Finally, a spatial-temporal joint attention is used to reconstruct and strengthen spatial and temporal information. Experiment results illustrate state-of-the-art performance of our STformer on ICU monitory datasets. While ensuring the accuracy of early warning, the computational complexity of STformer are also light.