Silent but deadly: there is nothing more destructive than data corruptions that cannot be caught by the various error capture tools in hardware and even in software, can be hard to spot before they have infected an entire application.
This is especially devastating at Facebook scale but engineering teams at the social giant have discovered strategies to keep a local problem from going global. A single hardware-rooted error can cascade into a massive problem when multiplied at hyperscale and for Facebook, keeping this at bay takes a combination of hardware resiliency, production detection mechanisms, and a broader fault-tolerant software architecture.
Facebook’s infrastructure team started an effort to understand the roots and fixes for silent data corruption in 2018 to understand how fleet-wide fixes might look—and what those might detection strategies could cost in terms of overhead.
Engineers found that many of the cascading errors are the result of CPUs in production but not always due to the “soft errors” of radiation or synthetic fault injection. Rather, they find these can happen randomly on CPUs in repeatable ways. Although ECC is useful, this is focused on problems in SRAM but other elements are susceptible. The Facebook engineering team that reported on these problems finds that CPU silent data corruptions are actually orders of magnitude higher than soft-errors due to a lack of error correction in other blocks.
Increased CPU complexity opens the doors to more errors and when compounded at hyperscale datacenter levels with ever-denser nodes, these at-scale problems will only become more problematic and widespread. At the hardware level, the problems can range from general device errors (placement and routing problems can lead to different arrival times for signals, causing bit-flips, for instance) and more manufacturing-centric problems like etching errors still happen. Further, early life failures of devices and degradation of existing CPUs can also have hard-to-detect impacts.
For example, when you perform 2×3, the CPU may give a result of 5 instead of 6 silently under certain microarchitectural conditions without any indication of the miscomputation in the system event or error logs. As a result, a service utilizing the CPU is potentially unaware of the computational accuracy and keeps consuming the incorrect values in the application.
“Silent data corruptions are real phenomena in datacenter applications running at scale,” members from the Facebook infrastructure team explain. “Understanding these corruptions helps us gain insights into the silicon device characteristics; through intricate instruction flows and their interactions with compilers and software architectures. Multiple strategies of detection and mitigation exist, with each contributing additional cost and complexity into a large-scale datacenter infrastructure.”
Facebook used a few reference application examples to highlight the impact of silent data corruption at scale, including an example with a Spark workflow that runs millions of computations of wordcount computations per day along with FB’s compression application, which similar millions of compression/decompression computations daily. In the compression example, Facebook observed a case where the algorithm returned a “0” size value for a single file (was supposed to be a non-zero number), therefore the file was not written into the decompressed output database. “as a result, the database had missing files. The missing files subsequently propagated to the application. An application keeping a list of key value store mappings for compressed files immediately observes that files that were compressed are no longer recoverable. The chain of dependencies causes the application to fail.” And pretty soon, the querying infrastructure reports back with critical data loss. The problem is clear from this one example, imagine if it was larger than just compression or wordcount—Facebook can.
Data corruptions propagate across the stack and manifest as application level problems. These types of errors can result in data loss and can require months of debug engineering time… With increased silicon density and technology scaling, we believe that academic researchers and industry should invest in methods to counter these issues.
Debugging is arduous but it is still at the heart of how Facebook handles these silent data corruptions, although not until they’re loud enough to be heard. “To debug a silent error, we cannot proceed forward without understanding which machine level instructions are executed. We either need an ahead-of-time compiler for Java and Scala or we need a probe, which upon execution of the JIT code, provides the list of instructions executed.” Their best practices for silent error debugging include are detailed in 5.2.
An overall suite of fault tolerance mechanisms is also key to Facebook’s strategy. These include redundancy at the software level but of course, this comes with costs. “The cost of redundancy has a direct effect on resources; the more redundant the architecture, the larger the duplicate resource pool requirements” even though this is the most certain path to probabilistic fault tolerance. Less overhead-laden ways of dealing with fault tolerance also include relying on fault tolerant libraries (PyTorch is specifically cited) although this is not “free” either, the impact on application performance is palpable.
“This effort would need a close handshake between the hardware silent error research community and the software library community.”
In terms of that handshake, Facebook is openly calling on datacenter device makers to understand that their largest customers are expecting more, especially given the cascading wide-net impacts of hardware-derived errors.
“Silent data corruptions are not limited to rare one in a million occurrences within a large-scale infrastructure. These errors are systemic and are not as well understood as the other failure modes like Machine Check Exceptions.” The infrastructure team adds that there are several studies evaluating the techniques to reduce soft error rate within processors those lessons can be carried into similar, repeatable SDCs which can occur at a higher rate.
A large part of the responsibility should be shared by device makers, Facebook says. These approaches are on the manufacturer’s side and can include beefing up the blocks on a device for better datapath protection using custom ECCs, providing better randomized testing, understanding increased density means higher propagation of errors and most important, understanding “at scale behavior” via “close partnership with customers using devices at scale to understand the impact of silent errors.” This would include occurrence rates, time to failure in production, dependency on frequency, and environmental issues that impact these errors.
“Facebook infrastructure has implemented multiple variants of the above hardware detection and software fault tolerant techniques in the past 18 months. Quantification of benefits and costs for each of the methods described above has helped the infrastructure to be reliable for the Facebook family of apps.” The infrastructure team plans to release a follow-on with more detail about the various trade-offs and costs for their current approaches.
More detail, including Facebook’s best practices for fault tolerance in software and architecting around potential hardware failures can be found here.
OK, facebook reinventing the wheel.
Nothing new. Google does the same analysis 10 years ago about ram corruption and concluded that the main case was bad hardware computer board engineering of the memory subsystem (noise on memory bus, ground bounces on random corner case and lack/miss placed power supply capacitors) and not SEU -at sea level in computer centers-.
To avoid miss calculation either systematic or random, multiple well known technics exists : mainly redundancy (with diversification, either by hardware using 2 out of 2 technics, or performing the same operation using algorithmic diversification ) or coherence checks and retry or fallback values. On real-time application, this is very common to ensure global robustness to random short term failure of interfaces.