Intel's Statistical Computing research team is addressing the problem of Machine Learning, prediction and decision making under uncertainty using a variety of probabilistic techniques. The techniques can be broadly divided into two categories: model-based approaches such as Bayesian Networks - a method of modeling the cause of current and future events-and data fitting techniques. The latter involves mining vast databases to identify patterns in data that could not be discovered by any other means. Statistical computing can fully utilize the massive and ever-growing computing power this is now available, to extract end user value-converting masses of data into actionable information.
Statistical computing provides the mathematical foundation for all Machine Learning research underway at Intel. It is essential to advancing Intel's vision of a proactive computing future, in which computing devices embedded throughout the physical environment will anticipate what humans need and sometimes even take action on their behalf. To make this possible, the devices must not only gather data from the environment but infer what that means through embedded machine learning algorithms, in order to decide what action to take.
The key objectives of the Statistical Computing project are to develop and apply new probabilistic algorithms to a range of high value applications, and to provide more general statistical tools to enable the broader research community to accelerate the development of machine learning applications. Potential applications span a broad range of domains, from silicon manufacturing to drug discovery to supply chain modeling. To prove the value of the technology, we are using several Intel manufacturing applications that are representative of the broader domain.
Applying Statistical Computing in Intel's Fabs
The Statistical Computing team is applying their techniques in several manufacturing applications within Intel. While the focus is on improving Intel's processes, the techniques are general enough, and could be adopted for use in a range of manufacturing settings.
Reducing Manufacturing Test Costs
One key application is focused on reducing testing costs in Intel's fabs. Applying Bayesian statistics, researchers are attempting to predict which segments of a given lot of silicon wafers are most likely to contain defective dies. These areas of the wafer become candidates for exhaustive testing, and the rest of the wafer can undergo selective testing without adversely affecting product quality. The goal is to eliminate redundant tests that increase costs without adding value. Learn more.
Machinery Diagnostics
Another application is addressing machinery diagnostics for manufacturing. Typically, when a piece of machinery malfunctions, operators consult a predetermined flowchart that outlines a single procedure to identify the cause of the malfunction. Researchers are automating the process of diagnostic troubleshooting, using Bayesian networks to compute the sequence of troubleshooting steps on the fly, using a probabilistic fault model of the device. The goal is to resolve equipment problems faster and more efficiently, minimizing downtime and reducing costs.
Lithography Wafer Classification
A third internal project is applying Bayesian techniques to determine if the calibration test wafers produced by lithography machines are defective. In tackling the problem, researchers used two sets of test wafers: one defective, the other defect-free. From these, they generated a set of 100 quantitative measures for each wafer, such as resistance at various spatial locations. Then they applied a machine learning algorithm to find subsets of the 100 characteristics that distinguished the wafers as defective or defect-free. The algorithm was able to identify patterns in the data that accurately classified each wafer. This pattern of relationships would have been impossible to detect manually because of the high dimensionality of the set of distinguishing characteristics. Classification of the test wafers allows rapid assessment of the health of the lithography machine producing the wafers.
Reversing the Curse of Dimensionality
The human eye can readily identify patterns or correlations in data for problems involving only two or three variables or "dimensions." For example, monthly data for rainfall and crop yields could be plotted on paper and the relationship between the two dimensions would be easy to see. However, as the number of dimensions increases beyond three, human intuition fails. If you were to plot data for a problem involving 100 dimensions, it would be impossible to visually identify clusters in the data, even though a pattern may exist. In a sense, the data points are far away from each other in a high-dimensional space. Mathematicians refer to this as the "curse of dimensionality."
Intel researchers, attempting to discriminate between two sets of silicon wafers-one set good, the other defective-faced just this challenge: too many interrelated features. They developed a statistical algorithm that was able to transcend the limits of human intuition and find the hidden patterns in the data. Specifically, the algorithm was able to organize the hundreds of dimensions into two distinct groups that accurately predicted whether or not a given wafer was defective.
Support for Other Machine Learning Research at Intel
In addition to these manufacturing applications, the Statistical Computing project team is lending its expertise to other machine learning projects underway at Intel. One project is exploring an alternative approach to network security that relies on distributed detection and inference instead of relying on a central network management node. A distributed system is more robust and has a better ability to outrun worms and viruses before they can cause major damage.
The team is also supporting the Human Activity Recognition project, whose goal is to build a system that can automatically infer a wide range of activities of daily living from sensor data, and provide proactive assistance for those with cognitive disabilities. The Statistical Computing team is working with Human Activity Recognition researchers to develop a mechanism to prompt the user when help is required. The challenge is to create algorithms that balance the cost of prompting too often (annoying the user, who would soon learn to ignore the prompts) with the cost of failing to prompt when assistance is crucial. To address the challenge, researchers are applying a technique known as a partially observable semi-Markov decision process (POSMDP), which is a method of inferring the subject's condition and weighing the tradeoffs over time between potential actions (in this case, prompting or not) when there is incomplete information.
Informing Future Intel Architecture
The work of the Statistical Computing team is being leveraged to understand the implications of machine learning for the design of future Intel microprocessors. Intel envisions that computationally challenging machine learning algorithms will be important in the future, as "intelligent" computing devices are embedded throughout the environment. To ensure that microprocessors of the future will be capable of running machine learning algorithms efficiently, Intel architects are simulating computational demands by studying workloads using actual data.
Enabling the Research Community
In addition to applying their research to real-world manufacturing problems, the Statistical Computing team is developing tools to enable the broader research community. The primary tool is the Probabilistic Networks Library (PNL), an open-source library of machine learning algorithms and models that any researcher can acquire to build new applications. The library is widely used by the machine learning research community and continues to expand, thanks to the ongoing contributions of 26 external developers and many more Intel engineers.
More recently, the research team has developed the Machine Learning Library (MLL). Whereas PNL is focused on probabilistic models, such as Bayesian networks, MLL is focused on statistical learning techniques which involve learning from data. These techniques are useful in applications such as mining data to find clusters of cancer genes. The Machine Learning Library is currently being used and developed internally at Intel but may be released externally in the future, following in the footsteps of Intel's Open Source Computer Vision Library (OpenCV) that now enjoys a wide following in the computer vision community.
Tao Wang; Qian Diao; Yimin Zhang; Gang Song; Chunrong Lai; Bradski, G, "A dynamic bayesian network approach to multi-cue based visual tracking", ICPR 2004
Denver Dash, (2005) Restructuring Dynamic Causal Systems in Equilibrium. In Robert Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AIStats 2005). To appear.
Greg Cooper, Weng-Keen Wong, Denver Dash, John Levander, Bill Hogan, and Michael Wagner. Bayesian biosurveillance using multiple data streams. In Proceedings of the Syndromic Surveillance Conference, 2004.
Denver Dash and Gregory F. Cooper. Model averaging for prediction with discrete Bayesian networks. Journal of Machine Learning Research, 5:1177 - 1203, September 2004.
John Mark Agosta; Thomas Gardos, "Bayes Network "Smart" Diagnostics", Intel Technology Journal, 2004
Georgios Theocharous, Kevin Murphy, Leslie Pack Kaelbling "Representing hierarchical POMDPs as DBNs for multi-scale robot localization", IEEE Conference on Robotics and Automation (ICRA), New Orleans, April 2004. (Also appeared at RUR'03 IJCAI workshop on Reasoning with Uncertainty in Robotics).
Sridhar Mahadevan, Mohammad Ghavamzadeh, Khashayar Rohanimanesh, Georgios Theocharous, "Hierarchical Approaches to Concurrency, Multiagency, and Partial Observability", Learning and Approximate Dynamic Programming: Scaling up to the Real World, Edited by Jennie Si, Andy Barto, Warren Powell, and Donald Wunsch, John Wiley & Sons, New York, 2004.
Statistical Computing Project Team
Gary Bradski
John Mark Agosta
Bob Davies
Denver Dash
Georgios Theocharous