Learn automatically from data – Logistic regression with L2 regulation in Python

Logistic regression

Logistic regression is used for binary classification problems – where you have some examples that are “on” and other examples that are “off.” You get as input to training set; which has some examples of each class along with a label that says if each example is “on” or “off”. The goal is to learn a model from the training data so you can predict the label with new examples that you haven’t seen before and don’t know the label.

For example, suppose you have data describing a lot of buildings and earthquakes (For example, the year the building was constructed, type of material used, magnitude of earthquakes, etc.), and you know if each building collapsed (“on”) or not (“off”) in every previous earthquake. Using this data, you want to predict whether a given building will collapse in a hypothetical future earthquake.

One of the first models that would be worth trying is logistic regression.

Codes it

I wasn’t working on this exact problem, but I was working on something close. As someone to practice what I preach, I started looking for a dead simple Python logistic regression class. The only requirement is that I wanted it to support L2 formalization (more on this later). I also share this code with a lot of other people on many platforms so I would have as few dependencies on external libraries as possible.

I couldn’t find exactly what I wanted, so I decided to take a walk down memory lane and implement it myself. I’ve written it in C ++ and Matlab before, but never in Python.

I will not make the deduction, but there are plenty of good explanations out there if you are not afraid of a small calculation. Just do a little Googling for “logistic regression derivation.” The big idea is to write down the probability of the data given some set of internal parameters and then take the derivative which will tell you how to change the internal parameters to make the data more likely. Do you have it? Well.

For those of you out there who know logistical regression inside and out, take a look at how short the train () method is. I really like how easy it is to do in Python.

legalization

I caught a little indirect flak during the March madness season to talk about how I regulated the latent vectors in my matrix factorization model for team offensive and defensive strength when predicting results in NCAA basketball. Apparently people thought I was talking rubbish – crazy, right?

But seriously, guys – regulation is a good idea.

Let me drive the point home. Look at the results of running the code (attached at the bottom).

Look at the top row.

On the left side you have the training set. There are 25 examples laid out along the x-axis, and the y-axis tells you if the example is “on” (1) or “off” (0). For each of these examples, there is a vector that describes its attributes that I do not show. After training the model, I ask the model to ignore the known training set labels and to estimate the probability that each label is “on” based only on the example description vectors and what the model has learned (hopefully things like stronger earthquakes and older buildings increase the probability of collapse) . The probabilities appear with the red Xs. In the upper left corner, the red Xs are right at the top of the blue dots, so the sample labels are very safe and that is always correct.

Now on the right side we have some new examples that the model has not seen before. This is called test set. This is essentially the same as the left side, but the model knows nothing about labels for the test set class (yellow dots). What you see is that it still does a decent job of predicting the labels, but there are some troubling cases where it is very confident and very wrong. This is known as overfitting.

This is where regulation comes in. As you go down the rows, there is stronger L2 control – or similarly the pressure on the internal parameters to be zero. This has the effect of reducing the model’s safety. Just because it can perfectly reconstruct the training set does not mean it has figured everything out. You would imagine that if you were relying on this model to make important decisions, it would be desirable to have at least some regularization in there.

And here is the code. It looks long, but most of it is to generate the data and plot the results. The majority of the work is done in the method (), which is only three (close) lines. It requires numpy, scipy and pylab.

* For full disclosure, I have to admit that I generated my random data in such a way that they are susceptible to overfitting, which may make logistics-regression-out-of-regulation look worse than it is.

Python code

from scipy.optimize.optimize import fmin_cg, fmin_bfgs, fmin

import numpy as np

def sigmoid (x):

return 1.0 / (1.0 + np.exp (-x))

class SyntheticClassifierData ():

def __init __ (self, N, d):

“” “Create N instances of d-dimensional input vectors and a 1D

class marking (-1 or 1). “” “

means = .05 * np.random.randn (2, d)

self.X_train = np.zeros ((N, d))

self.Y_train = np.zeros (N)

for I within range (N):

if np.random.random ()> .5:

y = 1

other things:

y = 0

self.X_train[i, :] = np.random.random (d) + means[y, :]

self.Y_train[i] = 2.0 * and – 1

self.X_test = np.zeros ((N, d))

self.Y_test = np.zeros (N)

for I within range (N):

if np.random.randn ()> .5:

y = 1

other things:

y = 0

self.X_test[i, :] = np.random.random (d) + means[y, :]

self.Y_test[i] = 2.0 * and – 1

class LogisticRegression ():

“” “A simple logistic regression model with L2 regularization (zero mean

Gaussian prior parameters). “” “

def __init __ (self, x_train = None, y_train = None, x_test = None, y_test = None,

alpha = .1, synthetic = false):

# Set strength for L2 control

self.alpha = alpha

# Set the data.

self.set_data (x_train, y_train, x_test, y_test)

# Initialize parameters to zero due to lack of a better choice.

self.betas = np.zeros (self.x_train.shape)[1])

def negative_like (self, beta):

return -1 * self.lik (betas)

def equal (self, beta):

“” “Probability of the data under the current parameter settings.” “”

# Probability of data

l = 0

for i within range (self.n):

l + = log (sigmoid (self.y_train))[i] *

np.dot {betas, self.x_train[i,:])))

# Prior probability

for k within range (1, self.x_train.shape[1]):

l – = (self.alpha / 2.0) * self.betas[k]** 2

return l

def train (self):

“” “Define the gradient and pass it to a scipy gradient-based

Optimizer. “” “

# Define the derivation of the probability with respect to beta_k.

# Need to multiply by -1 because we minimize.

dB_k = lambda B, k: np.sum ([-Selfalpha*B[-Selfalpha*B[-selfalpha*B[-selfalpha*B[k] +

self.y_train[i] * self.x_train[i, k] *

sigmoid (-self.y_train[i] *

np.dot (B, self.x_train[i,:]))

for I within reach (self.n)]) * -1

# The full gradient is only a series of component derivatives

dB = lambda B: np.array ([DB_k(Bk)[DB_k(Bk)[dB_k(Bk)[dB_k(Bk)

for k within range (self.x_train.shape[1])])

# Optimize

self.betas = fmin_bfgs (self.negative_lik, self.betas, fprime = dB)

def set_data (self, x_train, y_train, x_test, y_test):

“” “Take data already generated.” “”

self.x_train = x_train

self.y_train = y_train

self.x_test = x_test

self.y_test = y_test

self.n = y_train.shape[0]

def training_reconstruction (self):

p_y1 = np.zeros (self.n)

for i within range (self.n):

p_y1[i] = sigmoid (np.dot (self.betas, self.x_train)[i,:]))

return p_y1

def test_predictions (self):

p_y1 = np.zeros (self.n)

for i within range (self.n):

p_y1[i] = sigmoid (np.dot (self.betas, self.x_test)[i,:]))

return p_y1

def plot_training_reconstruction (self):

plot (np.arange (self.n), .5 + .5 * self.y_train, ‘above’)

plot (np.arange (self.n), self.training_reconstruction (), ‘rx’)

ylim ([-.1, 1.1])

def plot_test_predictions (self):

plot (np.arange (self.n), .5 + .5 * self.y_test, ‘me’)

plot (np.arange (self.n), self.test_predictions (), ‘rx’)

ylim ([-.1, 1.1])

if __name__ == “__main__”:

from pylab import *

# Create 20 dimensional datasets with 25 points – that will be it

# susceptible to overfitting.

data = SyntheticClassifierData (25, 20)

# Run for a variety of regulatory forces

alphas = [0, .001, .01, .1]

for j, a in enumerate (alphas):

# Create a new student, but use the same data for each run

lr = LogisticRegression (x_train = data.X_train, y_train = data.Y_train,

x_test = data.X_test, y_test = data.Y_test,

alpha = a)

print “Initial Probability:”

print lr.lik (lr.betas)

# Train the model

lr.train ()

# View execution info

tap “Finally Betas:”

print lr.betas

press “Finally equals:”

print lr.lik (lr.betas)

# Paste the results

part plane (len (alphas), 2, 2 * j + 1)

lr.plot_training_reconstruction ()

label (“Alpha =% s”% a)

if j == 0:

title (“Training Kit Reconstructions”)

part plane (len (alphas), 2, 2 * j + 2)

lr.plot_test_predictions ()

if j == 0:

title (“Test Set Predictions”)

View ()