Each year, Intel bundles core software developments into its Parallel Studio Suite and releases an updated version, which includes extensions to its suite of libraries, compilers, and analytical tools. Generally, they tend to add in new features or make key announcements and the Parallel Studio Suite 2016 release keeps with that trend by appealing to the needs of data analytics.
Wrapped into this morning’s update is the new Intel Data Analytics Acceleration Library (DAAL), which is a new toolset for the big data crowd that takes the Intel Math Kernel Library (MKL) and its performance emphasis from its years honed within high performance computing, and stripes over that with a long laundry list of common machine learning and data analysis algorithms and the ability to hook into platforms like Hadoop and Spark using the high performance MKL base, which is at the heart of many high performance computing workloads.
This marks another extension of MKL, which has been added to significantly since it emerged in 2003 and became the standard for programming in high performance computing areas including engineering, financial, and other application segments that use Intel processors. MKL itself is set for some new additions for its core base with the 11.3 release beyond the DAAL add-on, including new tooling for the visual effects industry, batch GEMM function capabilities, and more MPI wrappers.
According to Intel’s parallel computing guru, James Reinders, the MKL team, which also developed DAAL, had to find new ways of thinking about data-intensive versus compute-bound problems, but some of the basic issues of data movement and handling are shared in both the data analytics and HPC worlds. “We have experience with these types of problems from working with solvers in the HPC domain that are out of core. The need is to bring in the data as seldom as possible and get as much out of it as possible before discarding it and bringing in more data. We apply that same technique in DAAL for data analytics so that no matter what the function or algorithm is, DAAL combines the data handling intelligence with the computational algorithms to manage both the data handling and the number crunching.”
“Some people tend to use MKL in big data situations now, but they tend to be users that are coming from an HPC world and are moving into big data analytics. With DAAL we’re hoping to speak the language of big data but with an HPC core.”
Further, Reinders says that Intel was ahead of the curve in terms of their HPC roots with MKL feeding into DAAL because they have had to do a lot of work for HPC users who wanted to minimize power consumption. Since a great deal of the power use in HPC systems comes from moving data, the MKL team at Intel based many of their strategies around this. The challenge, and the part that was new, is that supercomputing centers collect, store, and address data with the assumption that it needs to be ready for computation. However, for the large-scale data analytics users Intel is targeting with DAAL, the data is in many formats and comes from many different streams and oftentimes, these were not collected and prepared with complex algorithms or computations in mind. It then becomes a data science challenge to not just find the data and prepare it for computation, but to make sure the number crunching is handled efficiently.
As seen below, there are several tools and platforms available in the new analytics library. Reinders says that in the interest of keeping the user base wide, they are not just supporting their own Hadoop distribution, but those of all the main distro providers as well. It has hooks to interface also with Spark, which is important, but in terms of data sources, SQL (as Diane Bryant, Intel datacenter lead noted at IDF around the open source Intel project on Streaming SQL), and NoSQL are key.
“Some big data problems are done in memory, many are out of core where the data won’t all fit in memory, so it is being streamed or handled in blocks. All of this is dependent on the mathematical algorithm being applied. So by taking what is known about your math algorithm and combining that with the data handling in DAAL, it is possible to be more efficient with what data you fetch, how long it is held in memory, and how it gets moved out.”
Although Intel is touting the future of a single system architecture, that lets HPC and data analytics work together seamlessly across shared systems and tools, this is still very much a data analytics approach for separate systems. Reinders says that while that is indeed an important goal for the future, for now, there is not a lot of carryover in terms of actual users running both HPC and “big data” analytics jobs on the same machine. He does say that there are ways to make that happen, and that Intel is indeed hard at work to keep pushing the two areas closer together.
Parallel Studio 2016 price details with the DAAL capabilities were released this morning as well, which are pictured below.