We have written much over the last few years about the convergence of deep learning and traditional supercomputing but as the two grow together, it is clear that the tools for one area don’t always mesh well with those of the other.
For instance, earlier this week we looked at the major limitations of traditional HPC file systems to keep up with the training density required for deep learning supercomputing jobs at scale. As we continue to analyze where other challenges lie we found that the toolchain for getting HPC simulation data to speak TensorFlow (and most other frameworks) is lacking as well, presenting an opportunity for developer teams to find a new way to translate classical simulation data more efficiently.
This bottleneck came up during a recent chat we had with Cray engineers about scaling training for CosmoFlow on the “Cori” CrayXC40 supercomputer. This workload used cosmological simulation data to train a neural network to detect galaxy clusters and make predictions about the formation of the universe.
More specifically, the team used fast simulators (lower accuracy simulations) to train the network for the level of accuracy needed. Thousands of these were run to produce the training set, all with slightly different input parameters. The final training set was over a terabyte that had to be converted into a format the TensorFlow-based network could actually use.
According to Peter Mendygral, a software engineer at Cray who worked with the NERSC and Berkeley Lab teams on CosmoFlow, even if in the case of this particular application, the translation between simulation data and neural network training frameworks was not difficult, it was computationally and I/O expensive to do.
“The simulations put out data relatively close to what is needed but there is reformatting that needs to be done for TensorFlow specifically. This is I/O intensive and it did take a while. There were some custom scripts needed for this. The hard part was training the actual network but this is still a step that relies now on a lot of custom scripting work for scientists.”
“In most cases, the problem is solved in an ad hoc way. There are some cases I’ve seen where people have applied actual tools like Apache Spark and some libraries to do the data transformation but it’s not really a generalizable tool, just a framework the can use, says Mendygral.
Since the CosmoFlow example is similar to other areas in HPC that are adopting deep learning (namely, using fast simulations to produce training sets that are fed into TensorFlow or other frameworks), there is a need for a tool or framework that can take over the task of transforming simulation data into a format that neural networks can use.
TensorFlow Transform is one such framework-specific tool for taking raw data from simulations and putting it into TensorFlow graphs but again, for terabyte-sized training sets, this comes with some data movement and processing costs. It could be useful for the HPC community to develop its own tools that capture the various features of the data and labels they want from simulations and is designed from the ground up to run efficiently on larger datasets.
Using the Cray XC40 supercomputer did ease some of the simulation to network results. As the company’s Paul Hahn says, the team was able to perform fully synchronous data-parallel training on 8,192 nodes of Cori, “achieving 3.5 Pflop/s sustained performance.” He adds, “Without the power of a supercomputer, this work simply could not be performed. Most deep learning exploration today is performed on small, single-node systems. In the authors’ estimate you’d need “more than 60 days of execution time on a single node to converge to a model at an acceptable loss.” The CosmoFlow run using 8,192 nodes “took roughly 9 minutes total with 8 minutes of training time.”
Cray adds that their PE Machine Learning Plugin, a part of the Cray Urika-XC Analytics and AI suite, improves the scalability and performance of TensorFlow distributed training. The plugin replaces the TensorFlow socket-based gRPC communications and the associated parameter server /worker architecture with an MPI-optimized communication mechanism, and implements speed-up algorithms for synchronous stochastic gradient descent training. Using the Cray PE ML Plugin, the team was able to use 8,192 nodes to do fully synchronized data-parallel training, where previous efforts on Cori had encountered significant scaling issues.