I downloaded the Australian Sign Language dataset from the UCI Knowledge Discovery in Databases Archive. The data consists of a sample of Australian Sign Language signs performed by volunteers. There are 95 unique signs, each recorded 27 times on different days. The data was recorded using two Fifth Dimension Technologies (5DT) gloves (one for each hand) and two Ascension Flock-of-Birds magnetic position trackers. Together, this produced 22 channels of data, 11 for each hand. These channels included x, y and z position, roll, pitch and yaw movements and finger bend measurements for each finger.

The data was supplied in the form of 2565 text files, representing each of the signs recorded. The files consist of time series data across the 22 channels. The data was recorded at approximately 100 Hz, and the average number of frames is 57 (with a range from 45 to 136 ).

To start, the data was loaded into an iPython notebook using a the os.walk() function to iterate through the directory:

import numpy as np import os import re

Next, I iterated through the directory structure and imported each file. The naming format was “Sign-#.tsd”, for example: “Alive-1.tsd” Using the re package, I could do a simple regex search to pull out the sign value (the y values of the data set), which is stored in a numpy array.

Because different signs take varying amounts of time to perform, each file was a different length. Since most models require that all the data is the same length, I initialized an array of NaN values equal to the length of the longest recording and used np.put() to replace the beginning of the array with the current file’s data.

x = np.empty(0) y = np.empty(0) data = np.empty(0) for root, dirs, files in os.walk('tctodd'): for i,fn in enumerate(files): if fn.endswith(&quot;tsd&quot;): current_file = np.full(2992,np.nan) # (2992,) vals = np.loadtxt(os.path.join(root,fn),delimiter='\t').ravel(order='F') np.put(current_file,range(0,len(vals)),vals) data = np.append(data,current_file) y = np.append(y,re.search('(.+?)-[0-9]',fn).group(1)) x = data.reshape((len(data)/2992,2992))

I saved the data using numpy’s np.save() function:

np.save('x',x) np.save('y',y)

Each row has the length 2992, which corresponds to the length of the longest recording (136) multiplied number of variables (22). This is a good start, however there is a lot of missing data present. I used numpy’s linear interpolator to fill in the gaps randomly:

import random def npinterpolate(data): def nanfinder(x): return np.isnan(x), lambda z: z.nonzero()[0] x_interp = [] for row in range(data.shape[0]): holder = [] dim = np.int(np.count_nonzero(~np.isnan(data[row]))/22) for i in range(1,23): scaffold = np.array([np.nan]*136) current_var = data[row,i*dim-dim:i*dim] randpts = np.sort(random.sample(range(136),dim)) scaffold[randpts] = current_var[:] nans,x = nanfinder(scaffold) scaffold[nans] = np.interp(x(nans),x(~nans),scaffold[~nans]) holder.extend(scaffold.tolist()) x_interp.extend(holder) return np.array(x_interp).reshape(data.shape) x_interp = npinterpolate(x) np.save('x_interp',x_interp)

This code generates a scaffold of NaN values, randomly chooses a list of indices from the scaffold which corresponds to the length of the sign being interpolated, fills the real values into the scaffold at those random indices, and then interpolates the missing values. Finally, the intepolated data is saved using np.save().

Now I have a single array of dimensions 2565 X 2992, where each row represents a different sign and I’m ready to begin training my model. But before any modelling can take place, I split my data into training and testing sets using train_test_split():

from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y)

Next, I instantiate a Sci-Kit Learn pipeline. Pipeline’s are a way of combining multiple steps into a workflow such that the entire process can be changed or optimized in an efficient manner.

from sklearn.pipeline import Pipeline, make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.svm import LinearSVC pipeline = make_pipeline(StandardScaler(), PCA(), LinearSVC())

The pipeline specifies three steps: scaling using StandardScaler(), principal components analysis (PCA) using the PCA() function and finally a linear support vector classifier (LinearSVC()).

There are almost 3000 features for each sign in my data set which might be overkill for model training, so to reduce the dimensionality of the data I used principal components analysis (PCA). The first step of PCA is to scale the data around the mean, which is done with StandardScaler function in Sci-Kit Learn. Once the PCA algorithm has reduced the dimensionality of the data to just those features which encapsulate the most variance, the data is used to train a linear support vector classification (SVC) model.

I’m also going to use GridSearchCV(), which searches through an array of parameters to find those parameters which perform best given the training data. GridSearchCV also performs cross validation (in this case we’ll use the default, which is 3-fold cross validation). The param_grid dictionary specifies which parameters will be searched. The naming convention is the function name (lower case) followed by two underscores and then the parameter name. As you can see below, we’ll test 10 n_components values between 50 and 100 in PCA and 10 C values between 0.01 and 0.1 in the Linear SVC.

from sklearn.grid_search import GridSearchCV param_grid = {'pca__n_components': np.arange(50,100,5), 'linearsvc__C': 10 ** np.linspace(-2,-1,10)} gridsearch = GridSearchCV(pipeline,param_grid=param_grid,verbose=3)

Now that everything is set up, we can run the model by calling the fit method on the gridsearch object. Sit back, since this code will fit 300 models and automatically determine the optimal number of components and the optimal C value – it could take a while.

gridsearch.fit(x_train, y_train)

When the model is done training, you can print the scores the model obtains on the training and testing sets quite easily by calling the score method on the gridsearch object. (Note: this code is written for Python 3, where print is a function. Change accordingly for Python 2).

print('The score on the training set is:',gridsearch.score(x_train,y_train).round(3)) print('The score on the testing set is:',gridsearch.score(x_test,y_test).round(3))

When I ran this code, I got a training score of 0.994 and a testing score of 0.875. This might suggest that my model is overfitting to the training data a bit. I could stand to do better if I optimized the value of C, which is the penalty parameter of the error term. Additionally, from what I read here, SVC’s are quite good at handling overfitting even when the number of features is greater than the number of observations. In this case it may be wiser to skip the PCA step (more on this later!)

We can also print the best parameters as determined by the grid search:

print('The',list(gridsearch.best_params_.keys())[0],'parameter value is:',list(gridsearch.best_params_.values())[0]) print('The',list(gridsearch.best_params_.keys())[1],'parameter value is:',list(gridsearch.best_params_.values())[1].round(4))

The final piece is to plot the n_components and C values to visualize how they affect the score:

import matplotlib.pyplot as plt %matplotlib notebook pca_dim = len(list(param_grid.values())[0]) c_dim = len(list(param_grid.values())[1]) scores = [x[1] for x in gridsearch.grid_scores_] scores = np.array(scores).reshape(c_dim, pca_dim) plt.matshow(scores, cmap='Spectral') plt.xlabel('PCA: Dimensionality') plt.ylabel('Linear SVM: C parameter') plt.colorbar() plt.xticks(np.arange(pca_dim), param_grid['pca__n_components']) plt.yticks(np.arange(c_dim), param_grid['linearsvc__C'].round(4)) plt.suptitle('PCA Dimensionality vs SVC C Parameter in Terms of Model Score',fontsize=16)

As you can see, the score ranges from bad to good across the various hyperparameters. However, I’m still not satisfied with the performance on the testing data, given that the model appears to have overfit the training data. As I mentioned above, let’s try leaving out the PCA step and instead performing a grid search over a wider range of C parameters using just the Linear SVC.

The only code that changes is the instantiation of the GridSearchCV() function:

param_grid = {'C': 10 ** np.linspace(-3,3,100)} # Note: here we don't need the name of the function since we're only optimizing parameters withing one function gridsearch = GridSearchCV(LinearSVC(),param_grid,verbose=3) gridsearch.fit(x_train,y_train)

With these settings, I again get a high score on the training data (0.999) whilst the test set scores a slightly better 0.897. The plot (and the code to make it) of the model score vs the C parameter is below. The model does best with a C parameter around 0.1, and increasing the C parameter from there does worse but the score is generally constant all the way up to 1000. It appears the overfitting has not been solved though.

scores = np.array([x[1] for x in gridsearch.grid_scores_]) values = np.log10(np.array([x for x in param_grid.values()]).ravel()) plt.plot(values,scores) plt.xlabel('C Values') plt.ylabel('Score') plt.suptitle('Score vs C parameter: LinearSVC()',fontsize=16)

As a final stab before moving on, I’m going to try the regular SVC() function, which differs specifically in the ability define a kernel function. The options are linear, polynomial, sigmoid or rbf. I stuck with the default, rbf, and used GridSearchCV to search for the optimal combination of the two parameters, C and gamma:

param_grid = {'C': 10. ** np.linspace(0,4,15), 'gamma': 10 ** np.linspace(-5,-1.3,15)} gridsearch = GridSearchCV(SVC(),param_grid,verbose=3,cv=3)

Here, the training set score is a 0.99 and the testing set has improved a little to 0.91. The optimal value for gamma was ~2.73e-4 and for C was ~373. Clearly, there is some value in this approach. The plot of C vs gamma is below:

It’s apparent from the plot that there is a trade-off between the gamma and C values. I’m not sure if there’s improvement to be had without altering the pre-processing steps.

All things said and done, however, I’m happy with a score of 0.91 on the testing data and I think the model performed pretty well. Could things improve? I’m sure. In my research, I stumbled across Triangular Global Alignment kernels, and I think the technique might be a great fit for this kind of classification. It’ll take a bit of work though, because that implementation doesn’t exist in a library yet, although it can be manually added to $PYTHONPATH. Sci-Kit Learn’s SVC module allows for custom kernels through the “kernel=’callable'” parameter (or the kernel matrix can be pre-computed and supplied here too!)

Finally, I should acknowledge Mohammed Waleed Kadous, who donated the data to the UCI KDD Archive.