## A Bridges Machine Learning Tutorial:

## Predicting Divorce Using a Logistic Regression Model

### Goals and prerequisites

This hour-long tutorial provides an introduction to using Bridges OnDemand notebooks for machine learning. Since machine learning often requires the use of large datasets, part of this tutorial will involve uploading data to Bridges pylon5 file space (also called $SCRATCH). Bridges home directories have very limited space so your $SCRATCH directory should be used for storing data.

To complete this tutorial you must have an active Bridges account. You also need to know your PSC username and account. If you don't have a Bridges account, see how to apply here.

If you don't know your account string on Bridges, log into Bridges. See the Connecting to Bridges section of the Bridges User Guide for help on logging in.

Once you are connected, type

id -Gn

Your account will be displayed. It's possible to have multiple accounts.

### Use sftp to add data to your Bridges pylon5 directory

- Download the divorce.rar file to your local computer from https://archive.ics.uci.edu/ml/machine-learning-databases/00497/ and decompress it.
- In the divorce directory find 'divorce.csv'. You'll need this file for the tutorial.
- If you don't know your account see 'Goals and prerequisites' above.
- To add 'divorce.csv' to your pylon5 file space via sftp, open your terminal and type the following commands, replacing
*account*and*username*with your information:sftp

*username*@data.bridges.psc.edu (Enter your password when prompted.) cd /pylon5/*account/username*mkdir divorce cd divorce put /*local/path/to/file*/divorce.csv

### Start a Jupyter notebook through the OnDemand interface on Bridges

- Go to the following link: https://ondemand.bridges.psc.edu and login with your Bridges username and password.
- Select "Jupyter Notebook Production" from the "Interactive Apps" drop-down.
- Fill in the form as follows:
- Set 'Number of Nodes' to '1'.
- Set 'Number of Hours' to '1'.
- Type in your account. If you don't know your account see 'Goals and prerequisites' above.
- Choose RM-shared or RM-small for Partition. You can learn about other Bridges partitions here.
- Extra Args can be left blank.

- Click "Launch".
- When the "Connect to Jupyter" link shows up, click on it.
- Start a new Jupyter Notebook by clicking on 'New' and choosing 'Python 3' from the dropdown.
- Change the title of the new notebook by clicking on "Untitled" on the new notebook page and typing "Divorce".

### Importing modules and loading the data

Now you will be typing python3 code into Code cells in the Jupyter Notebook. First import the required modules.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import os

Click "Run" to run your cell. In a new cell, type the following code to save the path to the divorce data file in your $SCRATCH (pylon5) file space.

env_var = os.environ scratch_dir = env_var.get('$SCRATCH') data = scratch_dir + '/divorce/divorce.csv'

Click "Run" to run your cell. In a new cell, type the following code to load 'divorce.csv'and save it as a Pandas DataFrame named 'divorced'.

divorced = pd.read_csv(data,index_col=None,sep=';') #The 'divorce.csv' is using colons to separate cells instead of commas so sep=';'

### Exploring the data

Run the following code to display the first five rows of your dataframe with the column names along the top.

divorced.head()

As you can see, a Pandas DataFrame is like an Excel spreadsheet. Each row is a different sample, in this case a marriage. Each column is a different attribute of the marriages (as defined by the answers to survey questions). You can find the Attribute Information (survery questions) here. In machine learning the attributes of the marriages are called independent variables or features. The variable you are trying to predict is called the dependent variable or target variable. In this case the rightmost column,'Class', contains the target variable. For this dataset '1' means divorced and '0' means not divorced.

Running the following code will show you the number of records for each class in the "Class" column.

divorced['Class'].value_counts()

0 86 1 84 Name: Class, dtype: int64

The number of records for 'divorced' and 'not divorced' are about the same so you can use accuracy to evaluate your model. If you'd like to learn about evaluating models where the number of records in each class is not balanced, this is a good overview.

Strong correlations between variables make the coefficients in a logistic regression model unstable and this may cause overfitting. Overfitting is when the model fits so well to the training data that it doesn't work for new data.

The following code will look at correlations between variables by calculating Pearson Correlation Coefficients.

#Create a correlation matrix corrmatrix=divorced.corr() corrmatrix.head()

Plotting correlation coefficients as a heatmap will make it easier to find correlated variables.

## Pearson Correlation matrix plotted as a heatmap

dims = (22, 12) fig, ax = plt.subplots(figsize=dims) sns.heatmap(corrmatrix, vmax=.8, square=True) plt.show()

There are pretty high correlations across the board, but it's possible that only a few columns are needed to make predictions. Try selecting only the columns relating to questions 6,7, 43,45, and 46 for your model, as these are the least correlated. In the following code the colon is selecting all rows, and the list of numbers (in brackets) is selecting the columns by number (rather than by name).

divorced2= pd.DataFrame(divorced.iloc[:,[5,6,42,44,45]])

### Training the model

Logistic Regression is a binary classifier because it places samples into categories such as True/False or Positive/Negative. In this case we are looking at divorced, or not divorced. However, logistic regression can be used as one step in solving multinomial classification problems. The Scikit-Learn Logistic Regression module we are using can implement multinomial logistic regression. If you'd like to learn more about Logistic Regression, you can find a good overview here.

Before fitting a model it is important to remove some data (usually about 20-30%) and save it as a test set. This test set will be used to test model accuracy. It cannot be used to train the model. The Scikit-Learn Train_Test_Split module is very useful for splitting datasets into training and test sets. First import the required module, then declare your target variable (y) and the variables being used to predict (X) before splitting both into training and test sets.

from sklearn.model_selection import train_test_split #Declare predictor variables. X = divorced2 #Declare target variable. y = divorced['Class'] #Split the data into train and test sets. X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20)

Next import and declare a logistic regression classifier and fit the model to the training data. Fitting the model means finding the right coefficients and intercept for the logistic regression equation. The right coefficients make it so the equation can correctly calculate the value of the target variable for all or most of the samples, when the values of the independent variables are plugged into it.

The regularization parameter (C) which you see below is used to prevent the model from overfitting. Regularization helps the model generalize better to new data not used for training. In this module a lower number for C means stronger regularization, so the value for C can be lowered (by dividing it by 10) if test set accuracy is much lower than training set accuracy. The default regularization for this Scikit-Learn Logistic Regression module is L2 regularization.

from sklearn.linear_model import LogisticRegression # Declare a logistic regression classifier. lr = LogisticRegression(C = 1e5) #Fit the model to the training data. fit = lr.fit(X_train, Y_train)

The following code calculates and prints the model coefficients. The coefficients can be useful for asessing the relative importance of each of the features (especially if all of the features are on the same scale).

print('Intercept') print (fit.intercept_) print('Coefficients') print(fit.coef_)

Intercept [-5.37940024] Coefficients [[ 0.74640783 10.84130871 0.90409127 0.42400431 0.04745981]]

The coefficient output is in the same order as the columns in the training dataset. I'll create a dataframe to make it easier to match the coefficients with the features.

weights =pd.DataFrame({'Attributes':list(X_train.columns),'Coefficients':list(fit.coef_)[0]}) weights

Attributes | Coefficients | |
---|---|---|

0 | Atr6 | 0.746408 |

1 | Atr7 | 10.841309 |

2 | Atr43 | 0.904091 |

3 | Atr45 | 0.424004 |

4 | Atr46 | 0.047460 |

The most important attribute was number 7: "We are like two strangers who share the same environment at home rather than family."

Makes sense.

### Making predictions

Now use your trained logistic regression model to predict Divorce using both the test and train datasets, and calculate the accuracy of predictions for both.

#Predict Divorce pred_y_sklearn = lr.predict(X_test) pred_y_sklearn = lr.predict(X_train) print('\n Accuracy') print('Test',lr.score(X_test, Y_test)) print('Train',lr.score(X_train, Y_train))

Accuracy Test 0.9117647058823529 Train 0.9044117647058824

There are no signs of overfitting, as the test set accuracy is about the same as the training set accuracy. An accuracy above 87% is not bad, but accuracy might be improved by using PCA (Principal Component Analysis) to deal with correlated variables, rather than throwing them out.