Detecting Ghosts with Boo-leans: Part 2
Unleash the Power of Logistic Regression with Alteryx
Now that a week has passed, you should know everything you need to know about ghosts. Your preparation might have come by watching the Ghostbuster movies or getting your scary story fix by reading Creepypastas on Reddit. This week, we will put the data to the test with our first logistic model.
To begin, ensure that you have the Alteryx R Installer downloaded. When this is downloaded, the Predictive District should be populated with an arsenal of tools. Today we will be concentrating on the Logistic Regression tool, but we will visit the Stepwise and Gradient Boosting tools later.
Last week we covered techniques on how to review the data with the Descriptive District. This assessment created an opportunity for feature engineering (creating new variables), which provided us with future fire power for our models. After those new formulas (Formula Tool) are applied, we need to bring in a Select Tool. This is crucial for 2 reasons: 1) Renaming column names to help with interpretation and 2) Making sure that the data types are correct. Logistic regression outputs are only 1 or 0, which means we need to make that column into a string value. Once that is updated we are ready to build our first machine learning algorithm.
When we connect the Logistic Regression tool to our workflow, we are going to see a bevy of new options. The key fields that we need to fill out are the target and predictor variable fields. Target fields are also known as dependent variables and this is the column of data that we are trying to predict. In this scenario, Ghost, will be the target variable. Predictor variables, also know as independent variables, are the columns of data that describe the target field. The fields representing the characteristics of whether it is a ghost or not, will be the predictor variables. For those with previous machine learning experience, Alteryx does provide customizable code free options such as creating cross validation folds and the ability to switch in between having a logit or probit model. Create three browse tools for each of the Logistic Regression tool’s anchors (trick: if you right click on the tool, there is an option to add a browse to all anchors) and you are ready to run!
The remaining portion of this blog will cover what these three anchors output and how we will use them moving forward. On top is the anchor marked with the letter O for Output. This anchor does not initially do much for us. When we click on it we just see a row of data and not much else. In part three, this anchor will be crucial in the deployment of the model. The anchor in which things get interesting and the gibberish emerges, begins in the R anchor, which stands for Reports. Here is where the tool generates and outputs the results of the model that are generated by R.
We are inundated with summary statistics when we look at the report section. Here are a few points of focus when trying to quantify and understand the model. Let’s start in record 6, which gives us the coefficients for the variables. This will tell us the relationship between the independent and dependent variable. From the last blog, these should be similarities between what we are seeing now and what we happened when we performed the correlation analysis. Each of the variables, except for rotting flesh, has a negative sign in front of the number. This means that the variables have an inverse relationship with the image being a ghost i.e. the smaller the bone length, the higher the likelihood that we are seeing a ghost. These coefficients will help us build our model in case we want to manually calculate each ghost image. Luckily for us, we will see next week, that the Score tool will do that for us.
Next, we want to look at the P Values. We are not going to review the high school statistics behind how we calculate P Values, but we do need to understand that they are crucial to determining the significance of a variable. The rule of thumb for deeming a variable significant, is that if the number is less than .05, then we should carry it forward. In our case, each of the variables will move forward, as they are all under that threshold. Now we need to see how accurate the model is.
This is where we will leverage the last anchor, I for Interactive Reporting, to evaluate model performance. Here we meet a friendly visit from our past, the Confusion Matrix. The Confusion Matrix is a quick way to evaluate how the model is predicting results against reality. Alteryx is great and calculates the accuracy, precision, and the optimal cutoff, but we need to leverage the matrix to compare what the model is getting right and wrong.
This is one of the bigger shifts between linear and logistic regression. With linear we tend to focus on R2 values only, where we have more to evaluate with logistic. Using Confusion Matrices allows us to see quickly why our predictions are right or wrong and where our focus needs to go to. Our initial focus should be on Predicted Positive/Actual Positive & Predicted Negative/Actual Negative quadrants. These areas show us how often we are correct when detecting ghosts. In this case, we see a lot of green, which shows that we are right more than we are wrong i.e. below we see that when the model predicts it is seeing a ghost it is right 80% of the time and wrong 20% of the time. For our purposes, this looks to be a tolerable level of accuracy, but this needs to be evaluated on a scenario basis. For instance, if we were predicting car accidents and you were an insurance broker, you may want a higher level of accuracy (or higher threshold) before deploying since there is a large monetary value attached with being incorrect.
After two weeks of work, our ghost detector has exceeded our thresholds and we are ready to share our results with the paranormal community. Before we can share our findings, we will want to take a few additional steps to optimize the model and then prep it for real time scoring. Next week, we will review how to perform those steps and wrap up our three part series on Logistic Regression in Alteryx.
Author: Justin Grosz