I’m here with some fascinating news, guys. Philip K. Dick may have been joking with the title of his famous novel Do Androids Dream of Electric Sheep? But science has recently answered this deep philosophical question for us. In the affirmative. The fabulous Janelle Shane trains neural networks on image recognition datasets with the goal of uncovering some incidental humour. She’s taken this opportunity to answer a long-standing question in AI. As it turns out, artificial neural networks do indeed dream of digital sheep. Whether androids will too is a bit more difficult. I’d hope we would improve our AI software a bit more before we start trying to create artifical humans.
As Shane explains in the above blog post, the neural network was trained on thousands or even millions (or more) of images, which were pre-tagged by humans for important features. In this case, lush green fields and rocky mountains. Also, sheep and goats. After training, she tested it on images with and without sheep, and it turns out it’s surprisingly easy to confuse it. It assumed sheep where there were none and missed sheep (and goats) staring it right in the face. In the second case, it identified them as various other animals based on the other tags attached to images of them. Dogs in your arms, birds in a tree cats in the kitchen.
This is where Shane and I come to a disagreement. She suggests that the confusion is the result of insufficient context clues in the images. That is, fur-like texture and a tree makes a bird, with a leash it makes a dog. In a field, a sheep. They see a field, and expect sheep. If there’s an over-abundance of sheep in the fields in the training data, it starts to expect sheep in all the fields.
But I wonder, what about the issue of paucity of tags. Because of the way images are tagged, there’s not a lot of hint about what the tags are referring to. Unlike more standard teaching examples, these images are very complex and there lots of things besides what the tags note. I think the flaw is a lot deeper than Shane posits. The AI doesn’t know how to recognize discrete objects like a human can. Once you teach a human what a sheep is, they can recognize it in pretty much any context. Even a weird one like a space-ship or a fridge magnet. But a neural net isn’t sophisticated enough or, most generously, structured properly to understand what the word “sheep” is actually referring to. It’s quite possible the method of tagging is directly interfering with the ANNs ability to understand what it’s intended to do.
The images are going to contain so much information, so many possible changing objects that each tag could refer to, that it might be matching “sheep” say to something entirely different from what a human would match it to. “Fields” or “lush green” are easy to do. If there’s a lot of green pixels, those are pretty likely, and because they take up a large portion of the information in the image, there’s less chance of false positives.
Because the network doesn’t actually form a concept of sheep, or determine what entire section of pixels makes up a sheep, it’s easily fooled. It only has some measure by which it guesses at their presence or absence, probably a sort of texture as mentioned in Shane’s post. So the pixels making up the wool might be the key to predicting a sheep, for example. Of course, NNs can recognize lots of image data, such as lines, edges, curves, fills, etc. But it’s not the same kind of recognition as a human, and it leaves AIs vulnerable to pranks, such as the sheep in funny places test.
I admit to over-simplifying my explanations of the technical aspects a bit. I could go into a lecture about how NNs work in general and for image recognition, but it would be a bit long for this post, and in many cases, no one really knows, even the designers of a system, everything about how they make their decision. It is possible to design or train them more transparently, but most people don’t.
But even poor design has its benefits, such as answering this long-standing question for us!
If anyone feels I’ve made any technical or logical errors in my analysis, I’d love to hear about it, insomuch as learning new things is always nice.