Predicting unknown classes with “Visual to Semantic” transfer. Applications for general AI.

Machine learning loves big data, especially when it’s labelled. Google and Tencent released their image tasks datasets consisting of millions and tens of millions of training example. OpenAI showed that just ramping up dataset and network parameters by 10x factor network start to behave on a new level.

That’s not how the human brain works. Not only we are able to learn from a few examples we are also capable of cross-domain learning. If somebody explains to you, what a tiger is and how it looks like, you will recognize it, even if you never have seen it before. We can even recognize a smell by a verbal description.

How can me predict previously unseen and unknown classes?

Cross-domain interpretability exists in biological brains. This level of abstraction goes deeper than each sensory channel.

Based on this logic, I came with an idea to make a neural net that will operate on an inter-sensory basis. After some research, I found that (obviously) this isn’t a new approach and it is used in so-called ZSL (zero-shoot learning). This allow to have (almost) unlimited number of classes in predictions.


The idea is simple – map image Embeddings from image classification network into Word Embeddings, making a bridge between visual and semantic representations.

This bridge theoretically allows solving a problem of limited class number during the training, as modern Word Embeddings (fastText as a first one) are not limited on vocabulary but are able to operate on a full alphanumeric space of English language.

Visual to Semantic transfer architecture.

Technical details

To test the idea I followed the following protocol:

  • 42620 images from Open Image V2 validation dataset were processed with ResNet101, pre-trained in Open Image dataset and an embedding vector (after pool5) was extracted for each image.
  • Each image corresponding class name was lowercased and embedding vector [1x300d] was obtained with model.
  • Image Embeddings were mapped to Word Embeddings using 4 layers Fully Connected network with dropouts. The network was trained for 2000 epoch with cosine distance as a loss function and the best checkpoint was chosen based on validation subset.


To give you a general sense of results let me give you a few cherry-picked results. Each image is classified with ResNet101 and with Image to Semantic transfer. To make things even I also present results of most similar vectors to the class name.

Obviously Semantic prediction give a different angle of view.

The main result I am trying to get is to show that richer and more varied “tag cloud” about the image can be extracted using this approach. So It is important to show a substantial difference between just class name to similar words mapping and Image to Semantic transfer.

While missing on music part, semantic predictions focus mostly on human and … some body parts.
Again, here we see much more of a detailed view, grasping parts that were missed/not-labelled in “classical” classification problem.

Further ideas

This simple experiment sparkes lot of further thoughts and ideas. What about a reverse word to image transfer? Generation of images via GAN using typed text? Those are hard-to-train, but simple to understand.

But we can make a step further here. Think how this architecture correlates to nature of conciseness. Human cognitive function deals a lot with a cross-domain transfer, constantly double-checking nature of reality by comparing “abstract embeddings” to find the ground truth. I think that novel Tesla’s FSD works in the same way – all neural networks merge into a single ground-truth vector, describing car surroundings.

NN architecture

class ImgToWordModel(nn.Module):
    def __init__(self):
        super(ImgToWordModel, self).__init__()

        def fc_block(in_filters, out_filters):
            block = [nn.Linear(in_filters, out_filters), nn.LeakyReLU(0.2, inplace=True), nn.Dropout(0.1)]
            return block

        self.model = nn.Sequential(
            *fc_block(2048, 2048*3),
            *fc_block(2048*3, 2048*2),
            *fc_block(2048*2, 1024),
        self.last_layer = nn.Sequential(nn.Linear(1024, 300))

    def forward(self, img):
        out = self.model(img)
        word = self.last_layer(out)

        return word

Everything else is pretty obvious. Inference code for ResNet101 was taken from Open Images, training details are following (2000 epochs)

loss_fn = torch.nn.CosineEmbeddingLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(),1e-3)
flags = Variable(torch.ones(256)).to(device)

I will definitely write another publication about philosophical aspects of this idea, keep in touch!

Oleksandr is currently working as a Senior Research Engineer at Ring . Previously was leading an AI laboratory at Skylum Software, founded Let's and Titanovo startups. He is doing research in image processing, marketing intelligence and assorted number of things with power of machine learning

Leave a reply:

Your email address will not be published.

Site Footer