Recognising sign language signs from glove sensor data

I downloaded the Australian Sign Language dataset from the UCI Knowledge Discovery in Databases Archive. The data consists of a sample of Australian Sign Language signs performed by volunteers. There are 95 unique signs, each recorded 27 times on different days.  The data was recorded using two Fifth Dimension Technologies (5DT) gloves (one for each hand) and two Ascension Flock-of-Birds magnetic position trackers. Together, this produced 22 channels of data, 11 for each hand. These channels included x, y and z position, roll, pitch and yaw movements and finger bend measurements for each finger.

The data was supplied in the form of 2565 text files, representing each of the signs recorded. The files consist of time series data across the 22 channels. The data was recorded at approximately 100 Hz, and the average number of frames is 57 (with a range from 45 to 136 ).

To start, the data was loaded into an iPython notebook using a the os.walk() function to iterate through the directory:

import numpy as np
import os
import re

Next, I iterated through the directory structure and imported each file. The naming format was “Sign-#.tsd”, for example: “Alive-1.tsd” Using the re package, I could do a simple regex search to pull out the sign value (the y values of the data set), which is stored in a numpy array.

Because different signs take varying amounts of time to perform, each file was a different length. Since most models require that all the data is the same length, I initialized an array of NaN values equal to the length of the longest recording and used np.put() to replace the beginning of the array with the current file’s data.

x = np.empty(0)
y = np.empty(0)
data = np.empty(0)

for root, dirs, files in os.walk('tctodd'):
 for i,fn in enumerate(files):
  if fn.endswith("tsd"):
   current_file = np.full(2992,np.nan) # (2992,)
   vals = np.loadtxt(os.path.join(root,fn),delimiter='\t').ravel(order='F')
   np.put(current_file,range(0,len(vals)),vals)
   data = np.append(data,current_file)
   y = np.append(y,re.search('(.+?)-[0-9]',fn).group(1))
x = data.reshape((len(data)/2992,2992))

I saved the data using numpy’s np.save() function:

np.save('x',x)
np.save('y',y)

Each row has the length 2992, which corresponds to the length of the longest recording (136) multiplied number of variables (22). This is a good start, however there is a lot of missing data present. I used numpy’s linear interpolator to fill in the gaps randomly:

import random

def npinterpolate(data):

 def nanfinder(x):
  return np.isnan(x), lambda z: z.nonzero()[0]

 x_interp = []
 for row in range(data.shape[0]):
  holder = []
  dim = np.int(np.count_nonzero(~np.isnan(data[row]))/22)
   for i in range(1,23):
    scaffold = np.array([np.nan]*136)
    current_var = data[row,i*dim-dim:i*dim]
    randpts = np.sort(random.sample(range(136),dim))
    scaffold[randpts] = current_var[:]
    nans,x = nanfinder(scaffold)
    scaffold[nans] = np.interp(x(nans),x(~nans),scaffold[~nans])
    holder.extend(scaffold.tolist())
   x_interp.extend(holder)
  return np.array(x_interp).reshape(data.shape)

x_interp = npinterpolate(x)
np.save('x_interp',x_interp)

This code generates a scaffold of NaN values, randomly chooses a list of indices from the scaffold which corresponds to the length of the sign being interpolated, fills the real values into the scaffold at those random indices, and then interpolates the missing values. Finally, the intepolated data is saved using np.save().

Now I have a single array of dimensions 2565 X 2992, where each row represents a different sign and I’m ready to begin training my model. But before any modelling can take place, I split my data into training and testing sets using train_test_split():

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y)

Next, I instantiate a Sci-Kit Learn pipeline. Pipeline’s are a way of combining multiple steps into a workflow such that the entire process can be changed or optimized in an efficient manner.

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC

pipeline = make_pipeline(StandardScaler(), PCA(), LinearSVC())

The pipeline specifies three steps: scaling using StandardScaler(), principal components analysis (PCA) using the PCA() function and finally a linear support vector classifier (LinearSVC()).

There are almost 3000 features for each sign in my data set which might be overkill for model training, so to reduce the dimensionality of the data I used principal components analysis (PCA). The first step of PCA is to scale the data around the mean, which is done with StandardScaler function in Sci-Kit Learn. Once the PCA algorithm has reduced the dimensionality of the data to just those features which encapsulate the most variance, the data is used to train a linear support vector classification (SVC) model.

I’m also going to use GridSearchCV(), which searches through an array of parameters to find those parameters which perform best given the training data. GridSearchCV also performs cross validation (in this case we’ll use the default, which is 3-fold cross validation). The param_grid dictionary specifies which parameters will be searched. The naming convention is the function name (lower case) followed by two underscores and then the parameter name. As you can see below, we’ll test 10 n_components values between 50 and 100 in PCA and 10 C values between 0.01 and 0.1 in the Linear SVC.

from sklearn.grid_search import GridSearchCV
param_grid = {'pca__n_components': np.arange(50,100,5),
              'linearsvc__C': 10 ** np.linspace(-2,-1,10)}

gridsearch = GridSearchCV(pipeline,param_grid=param_grid,verbose=3)

Now that everything is set up, we can run the model by calling the fit method on the gridsearch object. Sit back, since this code will fit 300 models and automatically determine the optimal number of components and the optimal C value – it could take a while.

gridsearch.fit(x_train, y_train)

When the model is done training, you can print the scores the model obtains on the training and testing sets quite easily by calling the score method on the gridsearch object. (Note: this code is written for Python 3, where print is a function. Change accordingly for Python 2).

print('The score on the training set is:',gridsearch.score(x_train,y_train).round(3))
print('The score on the testing set is:',gridsearch.score(x_test,y_test).round(3))

When I ran this code, I got a training score of 0.994 and a testing score of 0.875. This might suggest that my model is overfitting to the training data a bit. I could stand to do better if I optimized the value of C, which is the penalty parameter of the error term. Additionally, from what I read here, SVC’s are quite good at handling overfitting even when the number of features is greater than the number of observations. In this case it may be wiser to skip the PCA step (more on this later!)

We can also print the best parameters as determined by the grid search:

print('The',list(gridsearch.best_params_.keys())[0],'parameter value is:',list(gridsearch.best_params_.values())[0])
print('The',list(gridsearch.best_params_.keys())[1],'parameter value is:',list(gridsearch.best_params_.values())[1].round(4))

The final piece is to plot the n_components and C values to visualize how they affect the score:

import matplotlib.pyplot as plt
%matplotlib notebook

pca_dim = len(list(param_grid.values())[0])
c_dim = len(list(param_grid.values())[1])
scores = [x[1] for x in gridsearch.grid_scores_]
scores = np.array(scores).reshape(c_dim, pca_dim)

plt.matshow(scores, cmap='Spectral')
plt.xlabel('PCA: Dimensionality')
plt.ylabel('Linear SVM: C parameter')
plt.colorbar()
plt.xticks(np.arange(pca_dim), param_grid['pca__n_components'])
plt.yticks(np.arange(c_dim), param_grid['linearsvc__C'].round(4))
plt.suptitle('PCA Dimensionality vs SVC C Parameter in Terms of Model Score',fontsize=16)

plot_n_components_C

As you can see, the score ranges from bad to good across the various hyperparameters. However, I’m still not satisfied with the performance on the testing data, given that the model appears to have overfit the training data. As I mentioned above, let’s try leaving out the PCA step and instead performing a grid search over a wider range of C parameters using just the Linear SVC.

The only code that changes is the instantiation of the GridSearchCV() function:

param_grid = {'C': 10 ** np.linspace(-3,3,100)}
# Note: here we don't need the name of the function since we're only optimizing parameters withing one function
gridsearch = GridSearchCV(LinearSVC(),param_grid,verbose=3)
gridsearch.fit(x_train,y_train)

With these settings, I again get a high score on the training data (0.999) whilst the test set scores a slightly better 0.897. The plot (and the code to make it) of the model score vs the C parameter is below. The model does best with a C parameter around 0.1, and increasing the C parameter from there does worse but the score is generally constant all the way up to 1000. It appears the overfitting has not been solved though.

scores = np.array([x[1] for x in gridsearch.grid_scores_])
values = np.log10(np.array([x for x in param_grid.values()]).ravel())

plt.plot(values,scores)
plt.xlabel('C Values')
plt.ylabel('Score')
plt.suptitle('Score vs C parameter: LinearSVC()',fontsize=16)

plot_score_C

As a final stab before moving on, I’m going to try the regular SVC() function, which differs specifically in the ability define a kernel function. The options are linear, polynomial, sigmoid or rbf. I stuck with the default, rbf, and used GridSearchCV to search for the optimal combination of the two parameters, C and gamma:

param_grid = {'C': 10. ** np.linspace(0,4,15),
'gamma': 10 ** np.linspace(-5,-1.3,15)}

gridsearch = GridSearchCV(SVC(),param_grid,verbose=3,cv=3)

Here, the training set score is a 0.99 and the testing set has improved a little to 0.91. The optimal value for gamma was ~2.73e-4 and for C was ~373. Clearly, there is some value in this approach. The plot of C vs gamma is below:

plot_c_gamma_3

 

It’s apparent from the plot that there is a trade-off between the gamma and C values. I’m not sure if there’s improvement to be had without altering the pre-processing steps.

All things said and done, however, I’m happy with a score of 0.91 on the testing data and I think the model performed pretty well. Could things improve? I’m sure. In my research, I stumbled across Triangular Global Alignment kernels, and I think the technique might be a great fit for this kind of classification. It’ll take a bit of work though, because that implementation doesn’t exist in a library yet, although it can be manually added to $PYTHONPATH. Sci-Kit Learn’s SVC module allows for custom kernels through the “kernel=’callable'” parameter (or the kernel matrix can be pre-computed and supplied here too!)

Finally, I should acknowledge Mohammed Waleed Kadous, who donated the data to the UCI KDD Archive.

Advertisements

Recognizing Handwritten Digits Using a Neural Network in Python

My last post described a neural network written entirely in Python, which performed reasonably well on a dummy data set. In that script the sklearn.datasets.make_moons() function generated random points with two features each, and the neural network managed to classify those points into one of two possible y values. The decision boundary was easy to visualize since there were only two features, and it was clear that the neural network was dividing the data set well.

While this was great, it was not a particularly relevant use case for neural networks. So I grabbed the MNIST data set of handwritten digits and tried my hand at character recognition using a neural network. The MNIST data set is quite well known and very well studied (there is also Kaggle competition). I only used a subset of the data, borrowed from the Coursera Machine Learning course I’m doing.

Essentially, the data set consists of 5000 rows, each row corresponding to a handwritten character. In each row there are 400 values which make up a 20 x 20 pixel image, expressed as grayscale values. Below is a random selection of 100 of these digits:

figure_1

Adapting my previous code to accept this data set was relatively easy, because I had the good sense to parameterize most of the model variables. I did have to change a few things though. First, I wanted to split the data into a training and testing set, so I used sklearn.cross_validation.train_test_split() to carve out 25% of my data for testing the model at the end:

from sklearn.cross_validation import train_test_split
# Split the data into testing and training sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=0)

I made some subtle changes to the nn() function to account for the new variable names (for example I no longer have just m, but rather mtrain and mtest). I also added the maxiter argument to fmin_cg() since we’re now training on 3750 examples with 400 features each, which might take a while on my lowly 2009 Macbook.

Finally, I output the accuracy on the training set and additionally ran the predict() function on the testing set to see how the model fared on unseen data.

The very last thing I did was add in the printer() function which randomly grabs one row from the whole data set, calls predict() on that data and draws the image with the predicted digit in the title:

pred3

I spent most of my time playing with the regularization parameter,the number of nodes in the hidden layer and the maxiter argument. The data set is far more complex than the make_moons() model in my original post, so limiting the complexity through regularization is valuable here. I settled on a lambda value of 3, however this could probably use some optimization. Lower values increase the amount of time the model takes to train, but I didn’t find there was much benefit on the testing set – all I got was a very high training set accuracy. It’s clear that with low lambda values, the model is overfitting the training set, just as one would expect. Also as expected, if I increased lambda too much the training and testing accuracy dropped considerably.

Increasing the number of nodes in the hidden layer significantly increased the processing time, but I didn’t spend enough energy optimizing this – I just stuck with 10 nodes which was recommended in the Coursera notes.

The higher the maxiter value is, the longer the model takes to train. I kept this low (less than 200) for most of my testing because I didn’t want to wait around for my project to run. I’m going to try loading this up onto the supercomputer we have at work (I’m really excited for that!) and I’ll see what kind of results I can get.

I got a training set accuracy of 96% and a testing set accuracy of 92% using a lambda value of 3 and 200 iterations.

You can get my full code on Github.

That’s all for now – I mainly wanted to share this code. I was pleasantly surprised at just how easy it was to get this running. Have a look, and comment if you have questions or suggestions!

In my next post I’d like to transfer the neural network to a Python library (such as PyBrain) and compare performance. Look out for that post!

Implementing a Artificial Neural Network in Python

I’m in the middle on the Coursera Machine Learning course offered by Andrew Ng at Stanford University. A considerable chunk of the course is dedicated to neural networks, and this was the first time I’d encountered the technique. In the course we implemented a neural network in Matlab to perform handwritten digit recognition, but since I don’t regularly work in Matlab I decided to port the technique over to Python. I soon discovered this was not a trivial task, but I think I learned a lot in the process.

I started by exploring two excellent posts: first was Denny Britz’s post, which served as the framework for this work. The second was this post describing an extremely simple neural network (he got it down to 11 lines!). I also relied heavily on the documentation from the Coursera course I mentioned.

My entire code can be found on Github.

SO LET’S GET INTO THE CODE:

First I imported the necessary packages:

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import matplotlib
import scipy.optimize
from sklearn.metrics import accuracy_score

Next, I initialize some of the general variables needed for the neural network:

np.random.seed(0)
X, y = sklearn.datasets.make_moons(200, noise = 0.2)
m = len(X)
Input = X.shape[1]
Output = 2
alpha = 1
Lambda = 0.001
Hidden = 3
y_matrix = np.eye(Output)[y]

A neural network needs to be initialized with random values for the weights (thetas). Here, a simple function generates random theta values (although np.random.seed(0) ensures that while testing, these initialize the same each time). The randtheta() function takes two arguments: L_in and L_out. You can think of the theta values as the weights by which the input values of a layer are transformed into output values for the next layer, so L_in and L_out are the dimensions of the layers and thus we ensure that there are the correct number of theta values for each layer in the network.

I call the function and assign the result to theta_init. You’ll notice that everything get’s reshaped and concatenated, such that theta_init is a 1D array of all the theta values. This is because later we’ll pass theta_init to fmin_cg (from scipy.optimize) and this function can only accept the initial values as a 1D array.

def randtheta(L_in,L_out):
 np.random.seed(0)
 rand_epsilon = np.sqrt(6) / np.sqrt(L_in+L_out)
 theta = (np.random.random((L_out, L_in + 1)) *(2*rand_epsilon)) - rand_epsilon
 return theta

theta_init = p.concatenate((randtheta(Input,Hidden).reshape(((Input+1)*Hidden)),randtheta(Hidden,Output).reshape(((Hidden+1)*Output))))

Now that we’ve prepared all of the groundwork, we can get to the nitty gritty of the neural network. The first thing we have to define is our activation function, which serves to transform the inputs of any particular layer in the network to its outputs. Denny Britz’s network used a tanh activation function, but to keep consistent with the Coursera model I opted to go with a sigmoid activation function. There are a variety of choices here, and each choice affects the math involved in the backpropagation step (described later).

def sigmoid(z):
 return 1.0 / (1.0 + np.exp(-1 * z))

Next, we’ll perform forward propagation. This is the part of the neural network that takes the given weights (at this point, they’re randomly initialized) and the X values, and outputs the probabilities that an X value is each of the two possible outputs (namely, 1 and 0). In a nutshell, forward propagation takes the input layer (a1; the X values defined above along with a bias unit of all 1’s) and matrix-multiplies it by the theta values for layer one. The result of this is passed to the sigmoid() function to generate the values of the first hidden layer (a2; again, a bias unit of all 1’s is concatenated). The process is repeated for each subsequent layer (in our case there is only one more jump which brings us to the output layer; note that we obviously don’t add a bias layer to the output layer a3 since no further calculations will be performed).

The first time we encounter forward propagation is in the predict() function, which takes the current theta values and performs forward propagation to calculate the probabilities, and then outputs the predicted value for y. In order to convert these probabilities into specific calls (i.e. 1 or 0) I used np.argmax(), which returns the index of the maximum value (i.e. the column with the highest priority).

def predict(theta,x):
 # Unroll the theta values as before
 theta1 = theta[:(Input+1)*Hidden].reshape((Hidden,(Input+1)))
 theta2 = theta[(Input+1)*Hidden:].reshape((Output,(Hidden+1)))
 m = np.float(len(x))

 # Forward propagation (also as before!)
 a1 = np.concatenate((np.ones((1,m)).T,x),axis=1)
 z2 = a1.dot(theta1.T)
 a2 = np.concatenate((np.ones((1,m)).T,sigmoid(z2)),axis=1)
 z3 = a2.dot(theta2.T)
 a3 = sigmoid(z3)

 return np.argmax(a3, axis=1)

And that’s it! We have just generated the predicted value y given the two features of and the parameters theta. But this alone has little predictive value, since the weights in the model were generated randomly. Indeed, at this point the accuracy comes out to 0.5 – essentially a guess:

accuracy_score(y, predict(theta_init,X))

The predict function is only called once we’ve trained the model, and serves only to output the predicted values in order to assess accuracy or to plot the decision boundary (more on this later). Rather, the whole process revolves around the nn() function below. The nn() function runs forward propagation, calculates the cost given the theta values, runs backpropagation to determine the gradient, and then outputs the cost and gradient for us to use in the optimization function. Let’s go through it step by step:

def nn(theta):

 # Unroll the theta values from a 1D array
 theta1 = theta[:(Input+1)*Hidden].reshape((Hidden,(Input+1)))
 theta2 = theta[(Input+1)*Hidden:].reshape((Output,(Hidden+1)))
 
 # Forward propagation
 a1 = np.concatenate((np.ones((1,m)).T,X),axis=1)
 z2 = a1.dot(theta1.T)
 a2 = np.concatenate((np.ones((1,m)).T,sigmoid(z2)),axis=1)
 z3 = a2.dot(theta2.T)
 a3 = sigmoid(z3)

 # Cost function
 J = np.sum(-y_matrix*np.log(a3) - (1-y_matrix)*(np.log(1-a3)))/m + (Lambda/(2*m)) * (np.sum(theta1[:,1:]**2) + np.sum(theta2[:,1:]**2))

 # Backpropagation
 d3 = (a3 - y_matrix)
 d2 = (a3 - y_matrix).dot(theta2) * (a2*(1-a2))
 D1 = d2[:,1:].T.dot(a1) / m
 D2 = d3.T.dot(a2) / m
 if Lambda!=0: # This regularizes the gradients if Lambda > 0
  D1[:,1:] += (Lambda/m * theta1[:,1:])
  D2[:,1:] += (Lambda/m * theta2[:,1:])

 # Roll grads D1 and D2 back into 1D array
 grads = np.concatenate((D1.reshape(((Input+1)*Hidden)),D2.reshape(((Hidden+1)*Output))))

 return J, grads

First we unroll the theta values from the 1D array. Next, we perform forward propagation (just like in the predict() function). Now we can calculate the cost J. This step is crucial, as later when we feed our neural network into the optimization function, we’ll be trying to minimize J.

Now that we have the cost given these particular values of theta, we use backpropagation to find the gradients of the cost function with respect to our parameters (theta1 and theta2). I would need another whole blog post to explain the backpropagation algorithm in detail, so you’ll have to trust me here. I’ve implemented backpropagation to find the sigmoid gradients of our parameters. Finally, these gradients are re-rolled into a 1D array so they’re fit to be handed off to fmin_cg.

Unfortunately, the implementation of fmin_cg in Python will only return one value, unlike in Matlab where we can ask fmincg to return both the cost and the gradient in one step. To overcome this limitation, fmin_cg takes the argument fprime which is a function that returns the gradients of theta at X. So all I did was pseudo-break the nn() function into two parts, like so:

def cost(theta):
 return nn(theta)[0]

def grad(theta):
 return nn(theta)[1]

Now I can call fmin_cg to minimize cost, given a set of initial theta values, a cost function and a gradient function:

model = scipy.optimize.fmin_cg(cost, x0=theta_init, fprime=grad, full_output=1)
theta = model[0] # Stores the optimized theta values for use in predict()
print "The cost is %f" %(model[1])
print "The accuracy is %f" %(accuracy_score(y, predict(theta,X)))

And we’re done! This code, as it stands, will a take data set with two features through a three-layer neural network, wherein the hidden layer has three nodes, and return the weights that minimize the cost. Subsequently, feeding these weights to the predict() function returns the predicted values of y, and from this we can find the accuracy.

When I run this model, the cost is 0.132428 and an accuracy of 98%. Not bad.

VISUALIZING THE MODEL

Since the model is based on simple data (only two features) it’s actually pretty easy to visualize the outcome. I’ll go through that code now.

First, we define a function that generates everything we need to plot the decision boundary. This is heavily borrowed from Denny Britz’s post. We have two features, so it’s easy enough to plot the data assigning the features to the x and y axes respectively. The make_moons() function in sklearn.datasets (which generated our data set) makes these fun moon shaped data distributions:

plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)

scatter

To plot the decision boundary over this, we’ll take the minimum and maximum values for both of the features and construct an array of fictional points around it. When we feed these into the predict() function, we’ll get a predicted outcome and we can plot that on a contour plot to visualize:


def plot(pred_func):

 x_min, x_max = X[:,0].min() - 0.5, X[:,0].max() + 0.5
 y_min, y_max = X[:,1].min() - 0.5, X[:,1].max() + 0.5
 h = 0.01
 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

 Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
 Z = Z.reshape(xx.shape)

 plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
 plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.Spectral)

Now to actually draw the plot:


plot(lambda x: predict(theta,x))
plt.title("Decision Boundary for hidden layer size 3")

Here is the output when I run the full code:

output

Adding more hidden layers will increase the complexity of the model, but of course you run the risk of overfitting:

Accuracy: 0.85

Accuracy: 0.85

Accuracy: 0.985

Accuracy: 0.985

Accuracy: 0.99

Accuracy: 0.99

Accuracy: 1.00

Accuracy: 1.00

Increasing the number of layers does improve accuracy, but it is very likely overfitting the data, meaning our model is too specific for the training set and won’t generalize well on other data. More is not always better.

A WORD ABOUT REGULARIZATION

For those of you paying close attention, you may have noticed one of the initial parameters was Lambda and that a regularization term was added to the cost and the gradient. I won’t go into the specifics of regularization here, except to say that that adjusting Lambda adjusts the penalty for complexity – regularization attempts to correct for overfitting. A very large Lambda value results in high variance: the model will underfit the data. Too small, and we get high bias: the model will overfit the data. In my testing, if the value of Lambda was too small (or 0), the model needed too many iterations for cost to converge and I got an error. (This could be fixed by reducing the number of nodes in the hidden layer to 2). Too large a value for Lambda and I either got a straight line decision boundary, or the model didn’t fit at all. For best results, an intermediate Lambda value should be set.

Ideally, we’d be dividing our data randomly into training, cross-validation and testing sets, and using the cross-validation set to optimize Lambda, always checking our work against the testing set. I clearly have not done this here, although it is on my to-do list.

IN CONCLUSION…

I was pretty pleased in the end with what this model achieves. Although I haven’t tested it with other data sets, so expect a follow up post addressing that issue soon. I’m also quite new at this, so if you have suggestions or improvements, or if you notice glaring errors, please write in the comments or send me an email.