Memory Snapshots Bring Checkpointing Into The 21st Century
When you have a massively distributed computing job that can take months to run across thousands to hundreds of thousands of compute elements, one software hardware or software crash can mean losing an enormous amount of work. …