CoLab TPUs

mardi 09 octobre 2018

The other day I was having problems with a CoLab notebook and I was trying to debug it when I noticed that TPU is now an option for runtime type. I found no references to this in the CoLab documentation, but apparently it was quietly introduced only recently. If anyone doesn't know, TPUs are chips designed by Google specifically for matrix multiplications and are supposedly incredibly fast. Last I checked the cost to rent one through GCP was about $6 per hour, so the ability to have access to one for free could be a huge benefit.

As TPUs are specialized chips you can't just run the same code as on a CPU or a GPU. TPUs do not support all TensorFlow operations and you need to create a special optimizer to be able to take advantage of the TPU at all. The model I was working with at the time was created using TensorFlow's Keras API so I decided to try to convert that to be TPU compatible in order to test it.

Normally you would have to use a cross shard optimizer, but there is a shortcut for Keras models:

TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']

# create network and compiler
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
keras_model, strategy = tf.contrib.tpu.TPUDistributionStrategy(
    tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))

The first line finds an available TPU and gets it's address. The second line takes your keras model as input and converts it to a TPU compatible model. Then you would train the model using tpu_model.fit() instead of keras_model. This was the easy part.

For this particular model I am using a lot of custom functions for loss and metrics. Many of the functions turned out to not be compatible with TPUs so had to be rewritten. While at the time this was annoying, it turned out to be worth it regardless of the TPU because I had to optimize the functions in order to make them compatible with TPUs. The specific operations which were not compatible were non-matrix ops - logical operations and boolean masks specifically. Some of the code was downright hideous and this forced me to sit down and think through it and re-write it in a much cleaner manner, vectorizing as much as possible.

After all that effort, so far my experience with the TPUs hasn't been all that great. I can train my model with a significantly larger batch size - whereas  on an Nvidia K80 16 was the maximum batch size, I am currently training with batches of 64 on the TPU and may be able to push that even higher. However the time per epoch hasn't really improved all that much - it is about 1750 seconds on the TPU versus 1850 seconds on the K80. I have read code may need to be altered more to take full advantage of TPUs and I have not really tried playing with the batch size to see how that changes the performance yet.

I suspect that if I did some more research about TPUs and coded the model to be optimized for a TPU from scratch there might be a more noticeable performance gain, but this is based solely on having heard other people talk about how fast they are and not from my experience. 

Update - I have realized that the data augmentation is the bottleneck which is limiting the speed of training. I am training with a Keras generator which performs the augmentation on the CPU and if this is removed or reduced the TPUs do, in fact, train significantly faster than a GPU and also yield better results.

Libellés: coding, machine_learning, google_cloud
Aucun commentaire

I have previously written about Google CoLab which is a way to access Nvidia K80 GPUs for free, but only for 12 hours at a time. After a few months of using Google Cloud instances with GPUs I have run up a substantial bill and have reverted to using CoLab whenever possible. The main problem with CoLab is that the instance is terminated after 12 hours taking all files with it, so in order to use them you need to save your files somewhere.

Until recently I had been saving my files to Google Drive with this method, but while it is easy to save files to Drive it is much more difficult to read them back. As far as I can tell, in order to do this with the API you need to get the file id from Drive and even then it is not so straightforward to upload the files to CoLab. To deal with this I had been uploading files that needed to be accessed often to an AWS S3 bucket and then downloading them to CoLab with wget, which works fine, but there is a much simpler way to do the same thing by using Google Cloud Storage instead of S3.

First you need to authenticate CoLab to your Google account with:

from google.colab import auth

auth.authenticate_user()

Once this is done you need to set your project and bucket name and then update the gcloud config.
project_id = [project_name]
bucket_name = [bucket_name]
!gcloud config set project {project_id}

After this has been done files can simply and quickly be upload or downloaded from the bucket with the following simple commands:

# download
!gsutil cp gs://{bucket_name}/foo.bar ./foo.bar

# upload
!gsutil cp  ./foo.bar gs://{bucket_name}/foo.bar

I actually have been adding the line to upload the weights to GCS to my training code so it is automatically uploaded every couple epochs, which removes the need for me to manually back them up periodically throughout the day.

Libellés: coding, python, machine_learning, google, google_cloud
1 commentaires

Keras

mercredi 26 septembre 2018

When I first started working with TensorFlow I didn't really like Keras. It seemed like a dumbed down interface to TensorFlow and I preferred having greater control over everything to the ease of use of Keras. However I have recently changed my mind. When you use Keras with a TensorFlow back-end you can still use TensorFlow if you need to tweak something that you can't in Keras, but otherwise Keras just provides an easier to use way to access TensorFlow's functionality. This is especially useful for prototyping models since you can easily make changes without having to write or rewrite a lot of code. I used to write my own functions to do things like make a convolutional layer, but most of that was duplicating functionality that already exists in Keras. 

My original opinion was incorrect, Keras is a valuable tool for creating neural networks, and since you can mix TensorFlow in, there is nothing lost by using it.

Libellés: machine_learning, tensorflow, keras
1 commentaires

IOU Loss

mercredi 29 août 2018

When doing binary image segmentation, segmenting images into foreground and background, cross entropy is far from ideal as a loss function. As these datasets tend to be highly unbalanced, with far more background pixels than foreground, the model will usually score best by predicting everything as background. I have confronted this issue during my work with mammography and my solution was to use a weighted sigmoid cross entropy loss function giving the foreground pixels higher weight than the background.

While this worked it was far from ideal, for one thing it introduced another hyperparameters - the weight - and altering the weight had a large impact on the model. Higher weights favored predicting pixels as positive, increasing recall and decreasing precision, and lowering the weight had the opposite effect. When training my models I usually began with a high weight to encourage the model to make positive predictions and gradually decayed the weight to encourage it to make negative predictions.

For these types of segmentation tasks Intersection over Union tends to be the most relevant metric as pixel level accuracy, precision and recall do not account for the overlap between predictions and ground truth. Especially for this task, where overlap can be the difference between life and death for the patient, accuracy is not as relevant as IOU. So why not use IOU as a loss function?

The reason was because IOU was not differentiable so can not be used for gradient descent. However Wang et al have written a paper - Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation - which provides an easy way to use IOU as a loss function. In addition, this site provides code to implement this loss function in TensorFlow.

The essence of this method is that rather than using the binary predictions to calculate IOU we use the sigmoid probability output by the logits to estimate it which allows IOU to provide gradients. At first I was skeptical of this method, mostly because I understood cross entropy better and it is more common, but after I hit a performance wall with my mammography models I decided to give it a try.

My models using cross-entropy loss had ceased to improve validation performance so I switched the loss function and trained them for a few more epochs. The validation metrics began to improve, so I decided to train a copy of the model from scratch with the IOU loss. This has been a resounding success. The IOU loss accounts for the imbalanced data, eliminating the need to weight the cross entropy. With the cross entropy loss the models usually began with recall of near 1 and precision of near 0 and then the precision would increase while the recall slowly decreased until it plateaued. With IOU loss they both start near 0 and gradually increase, which to me seems more natural. 

Training with an IOU loss has two concrete benefits for this task - it has allowed the model to detect more subtle abnormalities which models trained with cross entropy loss did not detect; and it has reduced the number of false positives significantly. As the false positives are on a pixel level this effectively means that the predictions are less noisy and the shapes are more accurate.

The biggest benefit is that we are directly optimizing for our target metric rather than attempting to use an imperfect substitute which we hope will approximate the target metric. Note that this method only works for binary segmentation at the moment. It also is a bit slower than using cross entropy, but if you are doing binary segmentation the performance boost is well worth it.

 

Libellés: python, machine_learning, mammography, convnets
Aucun commentaire

Early Stopping

lundi 30 juillet 2018

I recently began using the early stopping feature of Light GBM, which allows you to stop training when the validation score doesn't improve for a certain number of rounds. This is especially useful if you are bagging models, as you don't need to watch each one and figure out when training should stop. The way it works is you specify a number of rounds, and if the validation score doesn't improve during that number of rounds the training is stopped and the round with the best validation score is used.

When working with this I noticed that often the best validation round is a very early round, which has a very good validation score but an incredibly low training score. As an example here is the output from a model I am currently training. Normally the training F1 gets up to the high 0.90s:

Early stopping, best iteration is:
[7]	train's macroF1: 0.525992	valid's macroF1: 0.390373

Out of at least 400 rounds of training, the best performance on the validation set was on the 7th, at which time it was performing incredibly poorly on the training data. This indicates overfitting to the validation set, which is just as bad as overfitting to the training set in that the model is not likely to generalize well.

So what to do about this issue? The obvious solution would be to provide a minimum number of rounds and begin to monitor the validation score for early stopping once that number of rounds has passed, but I don't see any way to do this through the LGB API. 

I am running this code using sklearn's joblib to do parallel processing, so I have create a list of the estimators to fit and then pass that list to the parallel processing which calls a function which fits the estimator to the data and returns it. The early stopping is taken care of by LGB, so what I did is after the estimator is fit I manually get the validation results and the train performance for the best validation round. If the train performance is above a specified threshold I return the estimator as normal. If, however, the train performance is below that threshold I recursively call the function again. 

The downside to this is that it is possible to get into an infinite loop, but if the thresholds are properly tuned this should be easily avoidable. 

 

Libellés: coding, data_science, machine_learning, lightgbm
1 commentaires

Archives du Blogue