I’m in the middle on the Coursera Machine Learning course offered by Andrew Ng at Stanford University. A considerable chunk of the course is dedicated to neural networks, and this was the first time I’d encountered the technique. In the course we implemented a neural network in Matlab to perform handwritten digit recognition, but since I don’t regularly work in Matlab I decided to port the technique over to Python. I soon discovered this was not a trivial task, but I think I learned a lot in the process.
I started by exploring two excellent posts: first was Denny Britz’s post, which served as the framework for this work. The second was this post describing an extremely simple neural network (he got it down to 11 lines!). I also relied heavily on the documentation from the Coursera course I mentioned.
My entire code can be found on Github.
SO LET’S GET INTO THE CODE:
First I imported the necessary packages:
import numpy as np import matplotlib.pyplot as plt import sklearn.datasets import matplotlib import scipy.optimize from sklearn.metrics import accuracy_score
Next, I initialize some of the general variables needed for the neural network:
np.random.seed(0) X, y = sklearn.datasets.make_moons(200, noise = 0.2) m = len(X) Input = X.shape Output = 2 alpha = 1 Lambda = 0.001 Hidden = 3 y_matrix = np.eye(Output)[y]
A neural network needs to be initialized with random values for the weights (thetas). Here, a simple function generates random theta values (although np.random.seed(0) ensures that while testing, these initialize the same each time). The randtheta() function takes two arguments: L_in and L_out. You can think of the theta values as the weights by which the input values of a layer are transformed into output values for the next layer, so L_in and L_out are the dimensions of the layers and thus we ensure that there are the correct number of theta values for each layer in the network.
I call the function and assign the result to theta_init. You’ll notice that everything get’s reshaped and concatenated, such that theta_init is a 1D array of all the theta values. This is because later we’ll pass theta_init to fmin_cg (from scipy.optimize) and this function can only accept the initial values as a 1D array.
def randtheta(L_in,L_out): np.random.seed(0) rand_epsilon = np.sqrt(6) / np.sqrt(L_in+L_out) theta = (np.random.random((L_out, L_in + 1)) *(2*rand_epsilon)) - rand_epsilon return theta theta_init = p.concatenate((randtheta(Input,Hidden).reshape(((Input+1)*Hidden)),randtheta(Hidden,Output).reshape(((Hidden+1)*Output))))
Now that we’ve prepared all of the groundwork, we can get to the nitty gritty of the neural network. The first thing we have to define is our activation function, which serves to transform the inputs of any particular layer in the network to its outputs. Denny Britz’s network used a tanh activation function, but to keep consistent with the Coursera model I opted to go with a sigmoid activation function. There are a variety of choices here, and each choice affects the math involved in the backpropagation step (described later).
def sigmoid(z): return 1.0 / (1.0 + np.exp(-1 * z))
Next, we’ll perform forward propagation. This is the part of the neural network that takes the given weights (at this point, they’re randomly initialized) and the X values, and outputs the probabilities that an X value is each of the two possible outputs (namely, 1 and 0). In a nutshell, forward propagation takes the input layer (a1; the X values defined above along with a bias unit of all 1’s) and matrix-multiplies it by the theta values for layer one. The result of this is passed to the sigmoid() function to generate the values of the first hidden layer (a2; again, a bias unit of all 1’s is concatenated). The process is repeated for each subsequent layer (in our case there is only one more jump which brings us to the output layer; note that we obviously don’t add a bias layer to the output layer a3 since no further calculations will be performed).
The first time we encounter forward propagation is in the predict() function, which takes the current theta values and performs forward propagation to calculate the probabilities, and then outputs the predicted value for y. In order to convert these probabilities into specific calls (i.e. 1 or 0) I used np.argmax(), which returns the index of the maximum value (i.e. the column with the highest priority).
def predict(theta,x): # Unroll the theta values as before theta1 = theta[:(Input+1)*Hidden].reshape((Hidden,(Input+1))) theta2 = theta[(Input+1)*Hidden:].reshape((Output,(Hidden+1))) m = np.float(len(x)) # Forward propagation (also as before!) a1 = np.concatenate((np.ones((1,m)).T,x),axis=1) z2 = a1.dot(theta1.T) a2 = np.concatenate((np.ones((1,m)).T,sigmoid(z2)),axis=1) z3 = a2.dot(theta2.T) a3 = sigmoid(z3) return np.argmax(a3, axis=1)
And that’s it! We have just generated the predicted value y given the two features of X and the parameters theta. But this alone has little predictive value, since the weights in the model were generated randomly. Indeed, at this point the accuracy comes out to 0.5 – essentially a guess:
The predict function is only called once we’ve trained the model, and serves only to output the predicted values in order to assess accuracy or to plot the decision boundary (more on this later). Rather, the whole process revolves around the nn() function below. The nn() function runs forward propagation, calculates the cost given the theta values, runs backpropagation to determine the gradient, and then outputs the cost and gradient for us to use in the optimization function. Let’s go through it step by step:
def nn(theta): # Unroll the theta values from a 1D array theta1 = theta[:(Input+1)*Hidden].reshape((Hidden,(Input+1))) theta2 = theta[(Input+1)*Hidden:].reshape((Output,(Hidden+1))) # Forward propagation a1 = np.concatenate((np.ones((1,m)).T,X),axis=1) z2 = a1.dot(theta1.T) a2 = np.concatenate((np.ones((1,m)).T,sigmoid(z2)),axis=1) z3 = a2.dot(theta2.T) a3 = sigmoid(z3) # Cost function J = np.sum(-y_matrix*np.log(a3) - (1-y_matrix)*(np.log(1-a3)))/m + (Lambda/(2*m)) * (np.sum(theta1[:,1:]**2) + np.sum(theta2[:,1:]**2)) # Backpropagation d3 = (a3 - y_matrix) d2 = (a3 - y_matrix).dot(theta2) * (a2*(1-a2)) D1 = d2[:,1:].T.dot(a1) / m D2 = d3.T.dot(a2) / m if Lambda!=0: # This regularizes the gradients if Lambda > 0 D1[:,1:] += (Lambda/m * theta1[:,1:]) D2[:,1:] += (Lambda/m * theta2[:,1:]) # Roll grads D1 and D2 back into 1D array grads = np.concatenate((D1.reshape(((Input+1)*Hidden)),D2.reshape(((Hidden+1)*Output)))) return J, grads
First we unroll the theta values from the 1D array. Next, we perform forward propagation (just like in the predict() function). Now we can calculate the cost J. This step is crucial, as later when we feed our neural network into the optimization function, we’ll be trying to minimize J.
Now that we have the cost given these particular values of theta, we use backpropagation to find the gradients of the cost function with respect to our parameters (theta1 and theta2). I would need another whole blog post to explain the backpropagation algorithm in detail, so you’ll have to trust me here. I’ve implemented backpropagation to find the sigmoid gradients of our parameters. Finally, these gradients are re-rolled into a 1D array so they’re fit to be handed off to fmin_cg.
Unfortunately, the implementation of fmin_cg in Python will only return one value, unlike in Matlab where we can ask fmincg to return both the cost and the gradient in one step. To overcome this limitation, fmin_cg takes the argument fprime which is a function that returns the gradients of theta at X. So all I did was pseudo-break the nn() function into two parts, like so:
def cost(theta): return nn(theta) def grad(theta): return nn(theta)
Now I can call fmin_cg to minimize cost, given a set of initial theta values, a cost function and a gradient function:
model = scipy.optimize.fmin_cg(cost, x0=theta_init, fprime=grad, full_output=1) theta = model # Stores the optimized theta values for use in predict() print "The cost is %f" %(model) print "The accuracy is %f" %(accuracy_score(y, predict(theta,X)))
And we’re done! This code, as it stands, will a take data set with two features through a three-layer neural network, wherein the hidden layer has three nodes, and return the weights that minimize the cost. Subsequently, feeding these weights to the predict() function returns the predicted values of y, and from this we can find the accuracy.
When I run this model, the cost is 0.132428 and an accuracy of 98%. Not bad.
VISUALIZING THE MODEL
Since the model is based on simple data (only two features) it’s actually pretty easy to visualize the outcome. I’ll go through that code now.
First, we define a function that generates everything we need to plot the decision boundary. This is heavily borrowed from Denny Britz’s post. We have two features, so it’s easy enough to plot the data assigning the features to the x and y axes respectively. The make_moons() function in sklearn.datasets (which generated our data set) makes these fun moon shaped data distributions:
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
To plot the decision boundary over this, we’ll take the minimum and maximum values for both of the features and construct an array of fictional points around it. When we feed these into the predict() function, we’ll get a predicted outcome and we can plot that on a contour plot to visualize:
def plot(pred_func): x_min, x_max = X[:,0].min() - 0.5, X[:,0].max() + 0.5 y_min, y_max = X[:,1].min() - 0.5, X[:,1].max() + 0.5 h = 0.01 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = pred_func(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral) plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.Spectral)
Now to actually draw the plot:
plot(lambda x: predict(theta,x)) plt.title("Decision Boundary for hidden layer size 3")
Here is the output when I run the full code:
Adding more hidden layers will increase the complexity of the model, but of course you run the risk of overfitting:
Increasing the number of layers does improve accuracy, but it is very likely overfitting the data, meaning our model is too specific for the training set and won’t generalize well on other data. More is not always better.
A WORD ABOUT REGULARIZATION
For those of you paying close attention, you may have noticed one of the initial parameters was Lambda and that a regularization term was added to the cost and the gradient. I won’t go into the specifics of regularization here, except to say that that adjusting Lambda adjusts the penalty for complexity – regularization attempts to correct for overfitting. A very large Lambda value results in high variance: the model will underfit the data. Too small, and we get high bias: the model will overfit the data. In my testing, if the value of Lambda was too small (or 0), the model needed too many iterations for cost to converge and I got an error. (This could be fixed by reducing the number of nodes in the hidden layer to 2). Too large a value for Lambda and I either got a straight line decision boundary, or the model didn’t fit at all. For best results, an intermediate Lambda value should be set.
Ideally, we’d be dividing our data randomly into training, cross-validation and testing sets, and using the cross-validation set to optimize Lambda, always checking our work against the testing set. I clearly have not done this here, although it is on my to-do list.
I was pretty pleased in the end with what this model achieves. Although I haven’t tested it with other data sets, so expect a follow up post addressing that issue soon. I’m also quite new at this, so if you have suggestions or improvements, or if you notice glaring errors, please write in the comments or send me an email.