I haven’t literally been back to Italy. I’m just revisiting the Identify Italy project.
Quick recap: According to a reputable source, Italy has 28% more elevators than the United States (and is “the world’s largest market for elevators”, despite Spain having more elevators than Italy).
When I read it, this statistic struck me as unusual. Italy has 1/30th the land area and 1/5th the population of the United States (as well as a Census, whose website banner crushes the US Census’ website).
Side note: Istat’s website provides random factoids about Italy’s demographics on the front page (22 million Italians read online books, ebooks, or newspapers, for instance). The Census Bureau doesn’t provide any random factoids. Come’on, guys, step up the game!
Regardless, my hypothesis was that this elevator crisis is caused by Italy lacking much of a suburban area. So, I downloaded satellite imagery of Italy from Google Maps and set to work identifying swaths of it as one of 7 Urban, Suburban, Industrial, Farm, Rural, Water, and Unknown. “Unknown” was largely for images which weren’t stored for some reason and so appeared grey, or were mostly cloud. I setup a little web UI where I can quickly identify a bunch.
Now I have 1.7 million image squares which cover Italy and the surrounding seas (Adriatic and Tyrrhenian, I was awake in Geography) and 897 data points identifying different images. The big question is how to train a computer to identify the different types. This question remained unanswered for a long period of time.
A few weeks ago, I decided to apply Apache Spark to this project. Spark provides a simple, powerful way to express data analytics algorithms. It also provides a platform which scales well horizontally, as well as applies your expressiveness as an in-memory lazily evaluated powerhouse. It’s really quite fast.
For this project, lacking any and all skills in machine learning, I figured I’d use Spark’s MLlib. They provide an implementation of a random forest, which I used.
After much tinkering with the random forests, I eventually decided on a two-stage tree classifier. The first stage deals only with single colors, and tries to predict the terrain type from the color of a single pixel. That is, for a given image that’s been classified by hand, it assigns that classification to each pixel in the image - for instance, image marked “Farm” may have a red pixel (255,0,0) and a blue pixel (0,0,255), so both of these colors will be associated with “Farm”. This is done for each image we have training data for (actually, 70% of the images, the other 30% being for testing), and this stream of pixels is fed into the random forest training algorithm. After building a confusion matrix on the test data, this is the result:
Precision: 0.6874467926180897 Confusion matrix: actual\predicted 983243.0 266.0 11.0 316805.0 10358.0 11868.0 0.0 47638.0 779.0 0.0 38079.0 9404.0 5182.0 0.0 80885.0 114.0 40.0 72375.0 7407.0 7811.0 0.0 382910.0 196.0 2.0 497653.0 3748.0 7255.0 0.0 69776.0 297.0 0.0 7852.0 770870.0 36286.0 0.0 44301.0 118.0 17.0 11204.0 9154.0 434804.0 0.0 17033.0 172.0 0.0 22698.0 422.0 199.0 0.0
Where the axes are both:
rural urban suburban farm unknown water industrial
In order. Obviously, numbers suck, so I made a beautiful chart:
1 is rural, 2 is urban, and so forth. Each column has been normalized. Observe the strong yellow band that goes from the upper-left corner to the lower-right corner, indicating that 897 samples provides a pretty decent training sample.
Also note the seeming confusion between categories 1 and 4 (rural and farm). Frankly, the distinction is hard for a human, so I can’t blame the computer.
The astute will note, however, that there are only 6 columns, but 7 rows. This indicates two things:
- The classifier never classified the pixel as “industrial” (the last category, 7), and
- I’m new to MATLAB, and when I put in a column of all zeroes it thought I didn’t want a column at all.
The trivial step would be to use a majority-voting process of the pixels to classify an image. However, I decided to use another random forest model. This model’s inputs are the (normalized) votes of the pixel classifier. For instance, if 2 pixels were classified as “urban” and 5 as “suburban” etc. than the input to the 2nd stage classifier would be (2, 5, etc.)
Let’s look at the result:
Wow, that sucks. Astute or not, it clearly can only correctly identify 1, 5, and 6 (rural, unknown, and water).
However, what this confusion matrix doesn’t show is that columns 2 and 3 (urban and suburban) each only had a test data size of 1. In other words, we only predicted urban and suburban a single time out of the test set, which is why it’s so yellow. Only 11 images in the test set were actually either suburban or suburban (and there were only about 100 images in the test set overall), so the jury’s still out on how accurate this classifier is.
One last cool thing.
MLlib’s random forest implementation has the property that if you have a trained model, you can access the trees individually. Once the whole-image classifier is trained, I then run the whole-image classifier on every image (all 1.7 million). Because I can see the predictions of the individual trees, I can see which images have the least consensus. That is, the Spark job can determine which images it’s least sure about.
The Spark job then prints out these images, and I copy this list into a MySQL database, which feeds that list into the webpage you can use to classify the images.
So when you go to http://pillow.rscheme.org/italy/, 50% of the time it shows you one of the top 500 images that it’s least sure about (the other 50% of the time it’s random).
So go, identify images. Your efforts are now optimized.