Sunday, September 14, 2014

Supervised Learning


Supervised Learning
  • Convolutional Networks (MNIST) [I]
    • Handwritten digit recognition with a back-propagation network, Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel (NIPS 1989) [PDF]
  • Alex NET (ImageNet Challenge) [I]
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS 2012. [PDF]
  • Visualizing Deep Networks [D]

12 comments:

  1. I think the question of whether things are a distributed code or grandmother cells is a quite interesting topic for discussion.

    Another paper to read in addition to the Zeiler one: http://www.cs.berkeley.edu/~rbg/papers/cnn-analysis.pdf
    Analyzing the Performance of Multilayer Neural Networks for Object Recognition. Pulkit Agrawal, Ross Girshick, Jitendra Malik. ECCV 14.

    Specifically, Section 4 suggests that there are grand-mother-like cells for only a few classes and argues that the top-K visualizations are somewhat misleading in suggesting the presence of highly specific object-detectors (e.g., the dog faces in Figure 2 of the Zeiler paper). It further suggests that the elements form a distributed code of sorts.

    I'm not entirely convinced of this argument -- whether things are a distributed code or grandmother cell depends really on how you break down the world into classes. For instance, is a wheel (Fig 2, layer 5, last row) a grandmother cell or not? It's shared by wheel-chair, old-style cars, and unicycles, but it's also a class of its own.

    ReplyDelete
    Replies
    1. I think this is a false dichotomy. If the representation is too distributed, then it's going to be difficult to bin the data into distinct "things." If the representation is too sparse, then it's not going to handle variability and noise well. What I see here is a very gradual transition from an absolutely distributed representation (pixels) to representations which are still somewhat distributed but more invariant and are useful for the task at hand.

      Would it be incorrect to say that this is an iterative, HOG-like
      process where the input gradients are chosen through supervision?
      HOG over HOG over HOG ..., albeit gradually:

      In other words:

      1. Take data inputs, specifically their variation (gradients).
      Let the task at hand (classification) guide which gradients
      to emphasize.

      2. Pool them in order to achieve a level of invariance

      3. Normalize the inputs.

      4. Repeat the process on the outputs.

      I think it would be interesting to see, given infinite data, where
      the power of the representation saturates as the number of layers
      increases.

      This paper http://arxiv.org/pdf/1409.1556v2.pdf
      shows that improvements can be made even with 16 layers!

      Delete
  2. I suggest also reading the paper - http://yann.lecun.com/exdb/publis/psgz/lecun-98.ps.gz
    This outlines the history of conv nets from the 1960s, and goes into a lot of detail about the design decisions and architectural choice. It tries to give many motivations for these choice, though not all of them are convincing.
    I suggest going through Section 2 of the paper mainly.

    ReplyDelete
  3. I generally hear that owing to smaller datasets previously, we could not train neural networks efficiently. Do we really need ImageNET to do this task? I mean YouTube has 150M videos; can we not use a sample of videos (to sample frames) to do this task?

    ReplyDelete
    Replies
    1. The sample frames in a Youtube video arent labelled, or at best poorly labelled.

      Delete
    2. But then can't it be used to learn the initial layers and then we could do fine-tuning using some specialized dataset on which we need to do some task?

      Delete
    3. My impression is that if you could do really great unsupervised pre-training and learn a powerful representation and then train on a small dataset, you'd have a Marr prize and a lot of money. Basically: nobody knows how to get really good representations.

      I'm not entirely convinced by the unsupervised story of simply looking at random images and learning a really great representation. Unsupervised by humans or semantics, sure, but I think there has to be some signal of some variety to learn something interesting. This could be time, context (see Carl's paper at ECCV '14 for a very interesting idea), motion, interaction, physics, etc.

      Delete
  4. I feel that the CNNs are considerably hacked to give the best results. There seems to be little generality in the approach taken by AlexNet. Well off course it gives a great result on the ImageNET, but then there seems to be little guarantee that it would work on a different dataset.
    So how does one choose the best kernel size for each convolutional layer. Or the optimal parameters for data augmentation.
    Isn't there a huge architectural dependence on the optimal paramaters.

    ReplyDelete
    Replies
    1. The paper mentions choosing parameters using validation data.

      Delete
    2. I think one reason why it looks that way is that all the parameters, from features, to mid-level, to the final classifiers are exposed. In comparison with "extract superpixels, do dense SIFT, fisher-vectorize, throw it into linear svm", this seems like a lot of parameters. But there are tons of parameters that were just tuned by other people. For instance, if you take a look at the HOG paper again, there was a serious brute-force parameter tuning:

      http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf

      The same sorts of normalization questions are actually really important for SIFT too -- if you don't do the right normalizations in the right order, it really doesn't work (from what I've been told).

      -------

      As for generalization, what we've seen is that they actually do work quite well across datasets -- I'll cover some experiments in class today.

      Delete
  5. I think someone (perhaps Krishna?) asked today whether it's possible to somehow leverage the effect that "greying out" certain regions in an image has on a classification network. I just saw a paper that seems to use exactly this to learn how to localise objects without bounding box annotations. http://arxiv.org/abs/1409.3964

    ReplyDelete
    Replies
    1. Yea, I wondering on similar lines, to get the top 5 classes, would one get better results if they occlude the region corresponding to highest class and then sequentially keep occluding the image to get the next prominent class.

      Delete