Recurrent neural network FPGA hardware accelerator for delay-tolerant indoor optical wireless communications

Opt Express. 2021 Aug 2;29(16):26165-26182. doi: 10.1364/OE.427250.

Abstract

The optical wireless communication (OWC) system has been widely studied as a promising solution for high-speed indoor applications. The transmitter diversity scheme has been proposed to improve the performance of high-speed OWC systems. However, the transmitter diversity is vulnerable to the delay of multiple channels. Recently neural networks have been studied to realize delay-tolerant indoor OWC systems, where long-short term memory (LSTM) and attention-augmented LSTM (ALSTM) recurrent neural networks (RNNs) have shown their capabilities. However, they have high computation complexity and long computation latency. In this paper, we propose a low complexity delay-tolerant RNN scheme for indoor OWC systems. In particular, an RNN with parallelized structure is proposed to reduce the computation cost. The proposed RNN schemes show comparable capability to the more complicated ALSTM, where a bit-error-rate (BER) performance within the forward-error-correction (FEC) limit is achieved for up to 5.5 symbol periods delays. In addition, previously studied LSTM/ALSTM schemes are implemented using high-end GPUs, which have high cost, high power consumption, and long processing latency. To solve these practical limitations, in this paper we further propose and demonstrate the FPGA-based RNN hardware accelerator for delay-tolerant indoor OWC systems. To optimize the processing latency and power consumption, we also propose two optimization methods: the parallel implementation with triple-phase clocking and the stream-in based computation with additive input data insertion. Results show that the FPGA-based RNN hardware accelerator with the proposed optimization methods achieves 96.75% effective latency reduction and 90.7% lower energy consumption per symbol compared with the FPGA-based RNN hardware accelerator without optimization. Compared to the GPU implementation, the latency is reduced by about 61% and the power consumption is reduced by about 58.1%.