OBET: On-the-Fly Byte-Level Error Tracking for Correcting and Detecting Faults in Unreliable DRAM Systems

Sensors (Basel). 2021 Dec 10;21(24):8271. doi: 10.3390/s21248271.

Abstract

With technology scaling, maintaining the reliability of dynamic random-access memory (DRAM) has become more challenging. Therefore, on-die error correction codes have been introduced to accommodate reliability issues in DDR5. However, the current solution still suffers from high overhead when a large DRAM capacity is used to deliver high performance. We present a DRAM chip architecture that can track faults at byte-level DRAM cell errors to address this problem. DRAM faults are classified as temporary or permanent in our proposed architecture, with no additional pins and with minor DRAM chip modifications. Hence, we achieve reliability comparable to that of other state-of-the-art solutions while incurring negligible performance and energy overhead. Furthermore, the faulty locations are efficiently exposed to the operating system (OS). Thus, we can significantly reduce the required scrubbing cycle by scrubbing only faulty DRAM pages while reducing the system failure probability up to 5000∼7000 times relative to conventional operation.

Keywords: DDR5; DRAM chips; availability; debugging; error correction codes; failure analysis; fault diagnosis; memory architecture; memory management; on-die ECC; semiconductor device reliability; semiconductor device testing.