15 January 2019
With deep learning taking the data science community by storm, people tend to look down at relatively simple machine learning algorithms. Although deep learning can tackle problems that require complex input-output mappings, these deep learning models lack adequate interpretability. Enter logistic regression, which provides interpretable reasoning for the predictive outputs of this type of statistical model. The purpose of this post is to provide an example application of logistic regression and how to interpret the model outcome.
Before logistic regression can be considered a valid algorithm for the data, check these seven assumptions to confirm logistic regression is the best algorithm for the job:
Logistic regression requires the dependent variable to be binary.
For binary regression, 1 should represent the desired outcome, and 0 should represent the undesirable outcome.
Assume linear relationship between continuous independent variables and logit function.
No mulitcollinearity. The independent variables can not be correlated with one another. This leads to difficulty interpreting the variance explained in the dependent variable.
No significant outliers.
Only meaningful variables should be included
Logistic regression requires large sample sizes
As mentioned above, logistic regression is used to predict dichotomous (binary) variables. Here are a few classification examples:
Predictions can be derived from simple (one independent variable) or multivariate (two or more independent variables). The following is the data modeling process for the Titanic dataset. The data contains metadata on over 800 Titanic passengers. The goal is to predict passenger survival based off of this information.
The first step of building any machine learning model is investigating the dataset to understand the characteristics of each feature to determine if it contains telling predictive information.
Import the necessary Python libraries:
import pandas as pd
import scipy as scpy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
%matplotlib inline
Import the csv file with Pandas:
Import the necessary Python libraries and peek into the first 8 rows of data:
train = pd.read_csv('titanic_train.csv')
train.head(8)
Next, use Pandas DataFrame methods .describe and .info to obtain feature distribution and data types.
From the results above, we know the dataset contains 12 features (PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked)
Here are preliminary thoughts for each feature:
Passenger ID: Unique identifier for each passenger; contains no important information for ML model
Drop PassengerID, Name, Cabin, and Ticket, as these features will not contribute to our model.
# default is axis = 0 for rows index; axis = 1 to drop columns
train_id = train['PassengerId']
train = train.drop(['PassengerId','Name', 'Cabin', 'Ticket'], axis = 1)
Next, check for missing data values in each column:
train.apply(lambda x: x.isnull().any())
Calculate the percentage of records that are null for features that have null values:
pd.DataFrame({'Percent Missing': train.isnull().sum() * 100 / len(train)})
Considering the Cabin feature has 77% of rows are null values, we decide to drop this column. Now let’s resolve the missing data issue with Age and Embarked features.
Calculate the overall average age of all Titanic passengers:
avg_age = train['Age'].mean()
print(avg_age)
# 29.69911764705882
We could use this average for imputation of all missing values, however, we can logically surmise that age is not evenly distributed across the three Titanic social classes. Let’s confirm this thinking:
sns.boxplot(x = 'Pclass', y = 'Age', data = train, hue = 'Survived')
Observing the above boxplot, the age of each passenger is not evenly distributed across the three social classes. We can investigate further by calculating the average age by Pclass grouping:
print(train.groupby(['Pclass']).get_group(1).Age.mean())
# Upper class: 38.233440860215055
print(train.groupby(['Pclass']).get_group(2).Age.mean())
# Middle class: 29.87763005780347
print(train.groupby(['Pclass']).get_group(3).Age.mean())
# Lower class: 25.14061971830986
Since age distribution is uneven, it would not be appropriate to input the overall average of 29.70 into each null value without considering the Pclass value:
train['Age'] = train.groupby(['Pclass','Survived'])['Age'].transform(lambda x:x.fillna(x.mean()))
Check again for percentage of missing values:
pd.DataFrame({'Percent Missing': train.isnull().sum() * 100 / len(train)})
The percentage of records that are null is very small, only 0.22%. Let’s find the characteristics of these records to see if there are traits in common. C = Cherbourg, Q = Queenstown, S = Southampton.
train[train.Embarked.isnull()]
Two passengers of very similar characteristics and circumstances: Same Pclass, Sex and Fare.
train.groupby(['Pclass', 'Sex']).get_group((1,'female')).Embarked.value_counts()
The most likely departure location is Southampton, so this value in imputed for the two null value records:
train.Embarked.fillna('S', inplace = True)
Confirm no remaining features have missing data:
For investigative purposes, check to determine the difference in fare price by social class and survival category:
train.groupby(['Pclass', 'Survived'])['Fare'].mean()
Check for percent uniqueness of each feature:
# unique().size counts the distinct values in each column; .size counts all records in each column
pd.DataFrame({'percent_unique': train.apply(lambda x: x.unique().size / x.size*100)})
train = pd.get_dummies(train)
train.head(10)
Store the survived predictor feature in a series variable and drop the feature in the main dataset to prepare for train_test_split compartmentalization:
train_label = train['Survived']
train = train.drop(['Survived'], axis = 1)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(train, train_label, test_size = 0.25, random_state = 3)
The data must be scaled in order to compensate for unequal distributions and intervals of each feature:
from sklearn.preprocessing import StandardScaler
scaled_trn = StandardScaler().fit_transform(X_train)
train_scaled = pd.DataFrame(scaled_trn, index = X_train.index, columns = X_train.columns)
train_scaled.head()
Now we can train the logistic regression model:
logreg = LogisticRegression()
logreg.fit(train_scaled, Y_train)
predictions = logreg.predict_proba(X_test)[:,1]
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(Y_test, predictions)
# auc = 0.6945764725852995
This logistic regression model can successfully predict the survival outcome of a Titanic passenger with approximately 70% accuracy.