RDMA data transfer and GPU acceleration methods for high-throughput online processing of serial crystallography images

J Synchrotron Radiat. 2020 Sep 1;27(Pt 5):1297-1306. doi: 10.1107/S1600577520008140. Epub 2020 Jul 31.

Abstract

The continual evolution of photon sources and high-performance detectors drives cutting-edge experiments that can produce very high throughput data streams and generate large data volumes that are challenging to manage and store. In these cases, efficient data transfer and processing architectures that allow online image correction, data reduction or compression become fundamental. This work investigates different technical options and methods for data placement from the detector head to the processing computing infrastructure, taking into account the particularities of modern modular high-performance detectors. In order to compare realistic figures, the future ESRF beamline dedicated to macromolecular X-ray crystallography, EBSL8, is taken as an example, which will use a PSI JUNGFRAU 4M detector generating up to 16 GB of data per second, operating continuously during several minutes. Although such an experiment seems possible at the target speed with the 100 Gb s-1 network cards that are currently available, the simulations generated highlight some potential bottlenecks when using a traditional software stack. An evaluation of solutions is presented that implements remote direct memory access (RDMA) over converged ethernet techniques. A synchronization mechanism is proposed between a RDMA network interface card (RNIC) and a graphics processing unit (GPU) accelerator in charge of the online data processing. The placement of the detector images onto the GPU is made to overlap with the computation carried out, potentially hiding the transfer latencies. As a proof of concept, a detector simulator and a backend GPU receiver with a rejection and compression algorithm suitable for a synchrotron serial crystallography (SSX) experiment are developed. It is concluded that the available transfer throughput from the RNIC to the GPU accelerator is at present the major bottleneck in online processing for SSX experiments.

Keywords: GPU; RDMA; RoCEv2; SSX; online data analysis; online data processing.