Computer Vision talks at ICML 2018

The 35th International Conference on Machine Learning (ICML) was held in Stockholm on July 10-15, 2018.  Links to the associated papers and video recordings (when they are posted) will be available on the website under each individual session. 


Overall, deep learning and reinforcement learning were still the hottest topics.  During each session slot, there was a reinforcement learning track and at least one deep learning (neural network architectures) track (and sometimes multiple deep learning tracks).  Here is the breakdown of submitted and accepted papers, which clearly shows the popularity of these topics. 


Talks on the fairness of machine learning algorithms continue to rise.  One of the "best paper" awards went to a
paper on delayed impact of fair machine learning.  This was an interesting paper comparing different methods of fairness – demographic parity, equality of opportunity, and unconstrained utility maximization – and introduced the “outcome curve”, a tool for comparing the delayed impact of fairness criteria.  They showed that fairness criteria may cause harms to groups they intended to protect if you consider the long-term effects. 

fairness-20180712_102108  fairness-20180712_102322

There was a decent amount of work in generative adversarial attacks. The first keynote on AI & Security and the other “best paper” award were on this
topic, as well as some tracks.  The “best paper” winner was “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”.  This paper examined the non-certified white-box-secure defenses against adversarial examples from ICLR 2018, and found that 7 of the 9 defenses relied on obfuscated gradients.  Their paper developed 3 attack techniques which circumvented 6 defenses completely and 1 partially.  With this, they argued that future work should avoid relying on obfuscated gradients, and also spoke about the importance of reevaluating others’ published results. 

Of particular interest to me this year were the Computer Vision talks.  Here is a quick summary of some of the innovation in Computer Vision.  There were 2 computer vision tracks, on Wed July 11 and on Fri July 13

Deep Predictive Coding Network for Object Recognition (link to paper)
They described a bi-directional and recurrent neural net, namely deep predictive coding networks (PCN), that has feedforward, feedback, and recurrent connections. Feedback connections from a higher layer carry the prediction of its lower-layer representation; feedforward connections carry the prediction errors to its higher-layer. Given image input, PCN runs recursive cycles of bottom-up and top-down computation to update its internal representations and reduce the difference between bottom-up input and top-down prediction at every layer. After multiple cycles of recursive updating, the representation is used for image classification. With benchmark datasets (CIFAR-10/100, SVHN, and MNIST), PCN was found to always outperform its feedforward-only counterpart: a model without any mechanism for recurrent dynamics, and its performance tended to improve given more cycles of computation over time. In short, PCN reuses a single architecture to recursively run bottom-up and top-down processes to refine its representation towards more accurate and definitive object recognition. 

Gradually Updated Neural Networks for Large-Scale Image Recognition (link to paper)
Neural networks keep getting deeper, traditionally by cascading convolutional layers or building blocks.  They present a new way to increase the depth: computational orderings to the channels within convolutional layers or blocks.  This not only increases the depth and learning capacity with the same amount of computational cost and memory, but also eliminates the overlap singularities resulting in faster convergence and better performance.  They use “GUNN” for an acronym. 

Neural Inverse Rendering for General Reflectance Photometric Stereo (link to paper)
Photometric stereo is the problem of recovering 3D object surface normals from multiple images observed under varying illuminations.  They propose a physics-based unsupervised learning approach to general BRDF photometric stereo where surface normals and BRDFs are predicted by the network and fed into the rendering equation to synthesize observed images.  This learning process doesn’t require ground truth normals; using physics can bypass the lack of training data. SOTA results outperformed a supervised DNN and other classical unsupervised methods. 

One-Shot Segmentation in Clutter (link to paper)
This was an interesting look at visual search.  They cited “Where’s Waldo?” as a fun example of solving a problem with only one example.  :)  They tackled the problem of one-shot segmentation: finding and segmenting a previously unseen object in a cluttered scene based on a single instruction example.  The MNIST of one-shot learning is the omniglot dataset, and they proposed a novel dataset called “cluttered omniglot” which used all characters but dropped them on top of each other in different colors. Using an architecture combining a Siamese embedding for detection with a U-net for segmentation, they show that increasing levels of clutter make the task progressively harder.  In this kind of visual search task, detection and segmentation are two intertwined problems, the solution to each of which helps solving the other.  They tried a pre-segmenting characters approach.  After segmenting using color, performance got very good.  They introduced MaskNet, an improved model that attends to multiple candidate locations, generates segmentation proposals to mask out background clutter, and selects among the segmented objects (segment first, decide later).  Such image recognition models based on an iterative refinement of object detection and foreground segmentation may provide a way to deal with highly cluttered scenes.

Active Testing: An Efficient and Robust Framework for Estimating Accuracy (link to paper)
Supervised learning is hungry for annotated data.  There are many approaches for dealing with the lack of labelled data in training (unsupervised, semi-supervised, etc).  Assemble a small high-quality test dataset.  The gold standard: given a fixed budget, annotate "all we can afford".  Their approach: Trade annotation accuracy for more examples.  They reformulate the problem as one of active testing, and examine strategies for efficiently querying a user so as to obtain an accurate performance estimate with minimal vetting.  They demonstrate the effectiveness of their proposed active testing framework on estimating two performance metrics, Precision@K and mean Average Precisions, for two popular Computer Vision tasks, multilabel classification and instance segmentation, respectively.  They further show that their approach is able to significantly save human annotation effort and more robust than alternative evaluation protocols.   

Noise2Noise: Learning Image Restoration without Clean Data (link to paper)
This is interesting for those working with grainy images.  They train a denoiser.  They apply basic statistical reasoning to signal reconstruction by machine learning - learning to map corrupted observations to clean signals - with a simple and powerful conclusion: it is possible to learn to restore images by only looking at corrupted examples, at performance at and sometimes exceeding training using clean data, without explicit image priors or likelihood models of the corruption.  In practice, they show that a single model learns photographic noise removal, denoising synthetic Monte Carlo images, and reconstruction of undersampled MRI scans - all corrupted by different processes - based on noisy data only.

Solving Partial Assignment Problems using Random Clique Complexes (link to paper)
This is an interesting paper to read if you need to perform a matching task, finding the same image (like a building) with occlusion, rotation, etc.  They present an alternate formulation of the partial assignment problem as matching random clique complexes, that are higher-order analogues of random graphs, designed to provide a set of invariants that better detect higher-order structure. The proposed method creates random clique adjacency matrices for each k-skeleton of the random clique complexes and matches them, taking into account each point as the affine combination of its geometric neighborhood.  They justify their solution theoretically, by analyzing the runtime and storage complexity of their algorithm along with the asymptotic behavior of the quadratic assignment problem (QAP) that is associated with the underlying random clique adjacency matrices. 

Generalized Earley Parser: Bridging Symbolic Grammars and Sequence Data (link to paper)
Future predictions on sequence data (e.g., videos or audios) require the algorithms to capture non-Markovian and compositional properties of high-level semantics. Context-free grammars are natural choices to capture such properties, but traditional grammar parsers (e.g., Earley parser) only take symbolic sentences as inputs. This paper generalizes the Earley parser to parse sequence data which is neither segmented nor labeled. This generalized Earley parser integrates a grammar parser with a classifier to find the optimal segmentation and labels, and makes top-down future predictions. Experiments show that this method significantly outperforms other approaches for future human activity prediction.

Neural Program Synthesis from Diverse Demonstration Videos (link to paper)
Interpreting decision making logic in demonstration videos is key to collaborating with and mimicking humans.  For example, learning how to make fried rice from watching a bunch of YouTube videos; humans understand variations like brown or white rice, etc.  To empower machines with this ability, they propose a neural program synthesizer that is able to explicitly synthesize underlying programs from behaviorally diverse and visually complicated demonstration videos.  Their model uses 3 steps: extract unique behaviors (using CNNs feeding into an LSTM), summarize (compare demo pairs to infer branching conditions, using multi-layer perceptron, to improve the network’s ability to integrate multiple demonstrations varying in behavior), and decode.  They also employ a multi-task objective to encourage the model to learn meaningful intermediate representations for end-to-end training.  They show that their model is able to reliably synthesize underlying programs as well as capture diverse behaviors exhibited in demonstrations.  Performance got better with the number of input videos.  The code is available at

Video Prediction with Appearance and Motion Conditions (link to paper)
Video prediction aims to generate realistic future frames by learning dynamic visual patterns. One fundamental challenge is to deal with future uncertainty: How should a model behave when there are multiple correct, equally probable futures? They propose an Appearance-Motion Conditional GAN to address this challenge. They provide appearance and motion information as conditions that specify how the future may look, reducing the level of uncertainty. Their model consists of a generator, two discriminators taking charge of appearance and motion pathways, and a perceptual ranking module that encourages videos of similar conditions to look similar. To train their model, they developed a novel conditioning scheme that consists of different combinations of appearance and motion conditions. They evaluate their model using facial expression and human action datasets – transforming input faces into different emotions with motion/video (generative videos).  They showed one interesting bug: Trump’s eyebrows turned black because no training data had white/blond eyebrows.  You can see this at, and the code is coming soon to