SQL Server Analysis Services features nine different data mining algorithms that looks for specific types of patterns in trends in order to make predictions about your data. This is a potentially very powerful tool and since I’ve been learning more about data mining recently I figured I’d put together a little bit of information and research I’ve done on these algorithms for my own reference as well as for the benefit of others.
Below you will see a list of the Data Mining Algorithms included in SSAS 2008 to 2012. I’ve included the type of the algorithm, what it does, and an example or two of when one might decide its an appropriate algorithm for your data and requirements. I’ve also included links to TechNet for more information.
What: This is probably the most popular data mining algorithm simply because the results are very easy to understand. Decision Tree tries to predict the value of a column or columns based on the relationships between the columns you have identified. Decision Tree also determines which input columns are most highly correlated to your prediction column(s).
When: Use the Decision Tree algorithm when you want to try to predict if a customer will buy your product or what characteristics may make a person a potentially good customer. For example, if you want to determine which customers you should send coupons, use Decision Tree to determine if a person has the qualities and characteristics of past customers.
What: This algorithm is used to predict continuous variables using continuous input variables. Linear Regression is a simpler version of Decision Tree without the splits in the tree.
When: Use the Linear Regression algorithm when you want to compute a trend for your sales data.
What: The Naïve Bayes algorithm is based on Bayes’ theorems and good for performing predictive analytics. This algorithm calculates a probability based on input columns you define.
When: An example of when you might use the Naïve Bayes algorithm would be when you want to predict how likely a customer is likely to respond to an email blast of how likely a patient is to get sick or contract a disease based on their demographic information.
What: The Neural Network algorithm may be one of the least used data mining algorithms because it is the most difficult to interpret. Basically Neural Network combines each possible input attribute value with each predictable attribute value to determine probabilities. These probabilities can be used for classification or regression, but you may have a difficult time determining how the algorithm reached a particular conclusion.
When: Use this when you want to analyze the relationships between many complex inputs to determine a certain output, such as predicting stock movements or currency fluctuations.
What: The Logistic Regression algorithm is a statistical method for determining the contribution of specified inputs to a particular set of outcomes. This algorithm is similar to Neural Network in the way that it models the relationships between the various inputs.
When: An example of when you might use this algorithm would be to predict what characteristics make a customer a repeat customer or if a convicted criminal is likely to become a repeat offender.
What: The Clustering algorithm is probably very close to Decision Tree as far as data mining algorithms that are used most frequently simply because, like Decision Tree, it is also very easy to understand. The Clustering algorithm groups cases in a data set using the input columns into groups or clusters of cases with similar characteristics.
When: This algorithm is great for detecting fraud or anomalies in your data because it is very easy to see which data does not fit into a cluster.
Type: Sequence Analysis
What: Sequence Clustering is similar to clustering except instead of looking for clusters based on the similarity of characteristics, the clusters are based on a model. The algorithm groups sequences of events that are identical.
When: An example of when you might use this algorithm would be when you want to determine which sequence of events are likely to lead to hardware failure in your environment.
What: The Association Rules algorithm defines combinations of items in a set by scanning the data in the specified inputs and identifying frequent combinations of items.
When: Use this algorithm when you want to identify opportunities for cross selling. For example, the Association Rules may determine that if a customer purchases milk there may be an opportunity to market cookies to the customer.
What: The Time Series algorithm is useful in predicting or forecasting continuous variables over time, such as sales metrics. Time Series does not require additional data to make predictions about the future. Time Series makes predictions 3 or 5 units into the future.
When: Use the Time Series algorithm when you want to predict sales for the next 3 or 5 months.