Day Four

Today's mission is to run VGG-16 on some Kaggle data, and maybe a couple of other things.

Okay! Let's get to it.

Training, validation, and test data

Downloading and sorting thousands of images into a file structure that VGG can understand using only the command line is turning out to be a surprisingly time-consuming task.

Yes, there's a bash command for that. It's starting to feel like there's a bash command for everything.

So how do you sort thousands of images into a file structure that VGG can understand?

You can do something like this:

- train
   - cats
   - dogs
- valid
   - cats
   - dogs
- test

What's up with all these folders? Well, to perform effective machine learning you typically divide your available data into three sets: training data, validation data, and test data.

Let's pretend for a minute that it's the height of the Clone Wars, we're the Kaminoans, and a messenger has just arrived from the Galatic Senate requesting clone assistance for an urgent mission. There is, however, a catch - due to the sensitive nature of the mission, only one clone can be chosen to go.

Yes this is about Star Wars if you have no idea what I'm talking about all you need to know for the purpose of this example is that we have a bunch of clones and we need a way to pick the best one.

The training data, like its name suggests, forms the initial training regimen for our clones. We pass them pictures of cats and dogs, along with labels that inform them whether the animal in the picture is a cat or a dog, allowing the clones to form a mental model of what cats and dogs look like.

The validation data puts our clones to the test, and seeks to identify the clone best suited to the coming mission. Despite being, well, clones, our clones now have a variety of specializations and character traits due to planned and unplanned events in their training. In turn, we expect performance on identifying cats and dogs in the validation data to vary across our clones, allowing us to identify a single top perfomer.

That should be it, right? The clone gets on a shuttle to worlds unknown and our job is done. Not so fast. We have to consider the possibility that the clone we identified performs well only on the validation data.

To understand how this could happen, consider the possibility of our clone noticing a speeder in the background of many of the pictures labeled "cat". This leads our clone to conclude that "an animal is more likely to be a cat when a speeder is present alongside the animal".

This is, of course, wrong. Our clone has no way of knowing that the photographer hired by the Galatic Senate simply conducted the shoot for cats at a time when the traffic on Coruscant happened to be heavier than usual. Our clone is fixated, incorrectly, on the noise (the speeder) instead of the signal (the cat). Data scientists call this behavior overfitting. The risk of overfitting becomes especially high when only a small amount of data is available (if our photographer had taken thousands of photos across different days and locations, our clone would be less likely to make such an error).

In this case, if our clone were to encounter a dog next to a speeder during the mission, the dog would be incorrectly identified as a cat. Disaster.

To try and avoid this, we set aside a final collection of test data - inaccessible to the clones until the validation process is completed. Our top clone is tested on the data once and only once. After assessing our clone's performance, we must decide whether to proceed with the mission or restart the entire process with a new batch of clones.

It is important that we do not train our clone any further. Further training will introduce the risk of overfitting on the test data as well as the validation data.

Did I say at the beginning of this post that today's mission was to actually run VGG-16? It turns out that sorting files without a GUI and trying to describe how data sets are used for model selection is a lot of work on its own!

In fact, writing about machine learning in plain language is turning out to be a lot more work than I'd anticipated (not that I'd ever thought it would be easy), and I'm not convinced I even have enough context to be an effective explainer.

Does this mean that I should write less? Focus on getting a more complete picture before attempting to share bits of it? Maybe! Or maybe not! We're all works in progress.