Detecting Ghosts with Boo-leans: Part 1
Unleash the Power of Logistic Regression with Alteryx
The journey into machine learning begins with linear regression. We are exposed to the concept in our high school statistics classes or the many reruns of the movie Moneyball on FX. When we learn linear regression, the focus is to look for the R2 value, and determine the quality of the model from that one number. If the number passes our threshold, we look like all stars to our team.
Unfortunately, this is the biggest mistake that an emerging data scientist can make. Over this 3 part series, we will right these bad habits by learning about Logistic Regression and how this technique forces the user to look past one number to build the optimal model.
On the most simplistic level, logistic and linear regression are similar; the one key difference is that linear outputs can be any integer, while logistic outputs are confined to binary digits or Boolean values. Instead of trying to figure out how many hits Mookie Betts is going to have tonight using linear, we can model if he is going to get a hit using logistic. To perform this analysis traditionally we need to know a programming language to perform this task, but within Alteryx’s Predictive District we can do this code free.
To ignite the Halloween spirit in all of us, we will be using a former Kaggle competition data set, Ghouls, Goblins, and Ghosts...Boo, to learn Logistic Regression. In this data set, there are multiple columns that describe characteristics of these Halloween figures, such as how much rotting flesh they have or what percentage of their soul is left. Our goal is to then use this data to inform us if there is a ghost in front of us or not
To begin the modeling process, we will need to clean up the data and perform descriptive analysis. This is where we can start to leverage Alteryx and use the it to make sure there are not any errors that can derail our results (Select & Data Cleansing Tools will do the trick). Next, we need to establish our training & test data set. In this scenario I am using two data sets, but if you are building a model with just one data set, you will need to split it up. This is done so that we can build a model with the training set and then test its accuracy against data the model has never seen before. An 80%/20% (Training/Test) should work perfectly in this scenario and can be done using the Sample tool. From here, we can perform statistical analysis and feature engineering to get ready for the modeling stage.
This data set was originally built to detect ghouls, goblins, and ghosts, but in this example, we are aiming to detect only the ghosts. To achieve this, we need to create a new column with an if statement that states a 1, if it is a ghost, and 0 if not a ghost. For now, we can leave this as a numerical field. Next, we will go into the most important Alteryx district for descriptive analytics, the Data Investigation district.
When going into this district, the first tool we must pull into our canvas is the field summary tool. With this tool we can expand the insights from what a browse tool normally gives. We can now see the min, max, median, average, mode, and standard deviation for all our numerical fields in the data set. Within the anchors inside this tool, we also get a scatter plot of each field. This is a good way to spot potential outliers and use our average/standard deviation(s) to create rules for them.
An oft overlooked comparison to do with this tool is to bring in a filter tool, filter for when Ghost=1, attach another field summary tool, and then join them together to compare when Ghost=1 and when Ghost=0. This side by side comparison is done to see if there are any significant variances from the remaining population of data. If there are, further feature engineering can be done to guide the model with signals (variables) to determine whether the image is a ghost or not.
An example of feature engineering would be to look at the field, color. Since that is not a numerical field, there were not significant insights to glean from the field summary. If we think knowledgeably about a ghost’s appearance, they will likely be either white or clear. While taking a glance, we can see those values come up quite often when looking at the data. To confirm this hypothesis, we can bring in the frequency table tool to see the occurrences of those values when Ghost=1. This tool confirmed the theory as it shows that over 64% of the ghosts are either white or clear. These types of insights are what lead us to create variables or perform feature engineering. To create this variable, we can bring in a formula tool and create another if statement but this time if color = “clear” or color = “white” then 1, else 0.
The last tool we will bring in before we start modeling, is to view the correlation between variables. Correlations can be done in a variety of ways, but in this example, we will use the Pearson Correlation tool. Here we will select all numerical fields (keeping Ghost as one) and then run the workflow to see the relationships between the variables. Our focus with this exercise is to see how the independent variables interact with our dependent variable, Ghost. For the most part we see that most of the variables have a negative relationship with Ghost (we will go into interpretation in Part 2). This will be something to keep in mind for when we build our model and ensure that there is consistency.
Now we are equipped with initial insights and can run our first code free model in Alteryx! In part 2 (coming out next Friday), we will run the model and interpret the statistical gibberish that that comes out of the model’s summary.