If one were to look at the approximate results from various deep learning tasks; from image classification to speech recognition and beyond, the stand-by quote from Bayesian statistics that “probability is orderly opinion, and that inference from data is nothing other than the revision of such opinion in the light of relevant new information,” takes on added weight.
This is not to say that the two fields are fundamentally interlinked, but conceptually, just as the decades-old ideas of Bayesian statistics are finally coming around into the mainstream view again, so too are deep learning algorithms, many of which were developed many years ago but lacked the infrastructure to find a fit in practical reality. And as an interesting juxtaposition, the science of exact figures, statistics, is finding ever more in common with the evolving art of approximate computing.
Speaking from his office at Columbia University, where he is targeting statistical models in post-doc work following an existing PhD in computer science with emphasis on cognitive science and computational linguistics, Bob Carpenter reflects on the state of both deep learning, a term he tells The Next Platform is confusing in many respects in that it does not clearly define the true narrow scope of models that are captured. Still, as he says following his years at Bell Labs as a software engineer and research scientist developing natural language processing and other speech recognition tools, It is definitely time for both Bayesian statistics and deep learning to emerge—even if it is at least a couple of decades too late.
To bring more machine learning capabilities into the fore with an abstraction layer to aid in the creation of complex statistical models, Carpenter is putting his nearly one million dollars in National Science Foundation grants into work around the open source “Stan” package, which presents a scalable approach to Bayesian modeling that can incorporate larger, more complex problems than tend to fit inside other frameworks.
Stan is a follow-on to the BUGS language, which is among the dominant languages for expressing Bayesian models. Like so many other tools that were developed well before the 2000s, it was ahead of its time, Carpenter says, but has gone through some interesting evolutions. It is not trying to compete with Julia, R, or Python either, since it is designed for a specific class of problems, rather, there are interfaces for these languages (and others, including Matlab) so users can write statistical models using the Stan framework. In essence, it is a library for doing statistics and mathematical function computing and taking derivatives automatically of very complex functions then drawing statistical inferences.
But with so many languages and approaches, where does a domain specific language like Stan fit for the writing of Bayesian statistical models? And since these models have been around for so many years, why is it only now that this is gathering steam?
“All of these things have been around since the 1980s and for a lot of the natural language processing and other things that companies like Google and Microsoft are improving upon, the late 1970s, especially for speech recognition,” Carpenter says. What’s changed is the computational capability—and that is bringing a lot of what Carpenter argues was ahead of its time to the forefront finally.
“If you think about what Google provides, for example, speech recognition on a small phone? That kind of thing, of course, didn’t exist in the 1980s. The algorithms were there but the computing power was not. We are now finally seeing the realization of a lot of the algorithms and ideas that are decades brought to life by that computing power.”
But of course, it is not just about having access to an ever-growing number of cores alone. Carpenter marveled at the massive data collection efforts that pushed further evolution of the models and languages to keep pace. From large scale genome wide associative studies, massive satellite data collection and analysis, the structured and unstructured analysis of social media data—these are all critical datasets to continue to expand upon now that the horsepower to build the networks is in place.
To be fair, Carpenter says that while Bayesian statistics will continue to play an important role in the ever-broadening classification of “deep learning” problems, what he thinks is machine learning and deep learning are still a relatively narrow class of problems. But what a lot of the attention and funding (i.e. Google’s billion dollar acquisition of DeepMind) has meant for the field is that the promise of those early days of algorithmic development are finally able to come full circle. For instance, he says, if you look at work being done at Microsoft, as another example, where they’ve pushed natural language processing accuracy rates from 80% to 87%, it sounds incremental, but these are in fact huge leaps forward—not just for NLP but for other machine learning segments where similar approaches can be extended.
So where will the combination of bolstered statistical approaches mesh with the next generation of deep learning and machine learning algorithms to actually take advantage of the new wave of computational capability? Carpenter says the answer is not cut and dry. “The problem with a lot of the machine learning algorithms is that they are intrinsically serial, so they can’t just be easily parallelized. The contrast would be Google’s correction for example, where they can just count the occurrences of words on the web by breaking it up into millions of pieces and doing Map Reduce on it. Most of the problems like speech recognition do not have that property where it’s possible to fit all of those pieces because everything interacts. “
This makes more sophisticated modeling approaches, including via Stan, more valuable going forward, he says, in that it will create the conditions for an inherently scalable model from the outset. It is still requiring a heroic effort to get these to scale and it will still take another heroic effort in terms of more algorithm development to move ahead.” As is the case in many other areas, including HPC, the goal is to efficiently parallelize these types of problems and actually take advantage of current hardware (including accelerators like GPUs). Parallelizing across different machines for the deep Bayesian statistical models is the subject of Carpenter and his team at Columbia’s current research.
“When we want to do something like split the data into multiple pieces, fit each piece of data on a different machine, there is nothing off the shelf to do that in terms of either the algorithms or implementations—we are trying to understand what happens to your estimation procedure when each worker only sees a little piece of it, and of course, figure out all the engineering to make this real on actual hardware. There are a lot of people working on this, as evidenced by the more recent MIPS machine learning conferences.”
All of that said, while the deep image classification and large-scale speech recognition projects at Baidu, Google, and Microsoft tend to get the most attention, what’s happening on the ground where deep learning and deep statistical models meet might sound less exciting, but is critical for the next phase of deep learning and Bayesian model growth. And it takes the kind work on both the hardware and software development fronts to bring both more robust statistical modeling approaches and companion deep learning and machine learning approaches into a simultaneous golden age.
For example, as part of his work at Columbia, Carpenter has been working with drug maker Novartis on a drug-disease interaction clinical trial for an ophthalmology drug that spans close to 2,000 patients. While the models for determining complex interactions across individual patients are not hugely complicated, the reason this hasn’t happened in such detail before is that such models would not fit on traditional systems with traditional approaches. By coupling deep learning approaches with Bayesian statistical models he says solving previously intractable problems like this will be commonplace. It is simply a maturity cycle—one that has taken decades but is coming full circle now that the critical engineering, model, and infrastructure pieces are in place.
For those who can read Italian, I would recommend this work.
https://unipd.academia.edu/DPantano/Books
The proposed ideas allow us to understand the profound logic that connects deeplearning whit the emerging rules and the bayesian statistics
I think it’s NIPS not MIPS