CNTK v2.2 Release Notes

Breaking change

  • This iteration requires cuDNN 6.0 in order to support dilated convolution and deterministic pooling. Please update your cuDNN.
  • This iteration requires OpenCV to support TensorBoard Image feature. Please install OpenCV before you install CNTK.

Documentation

Add HTML version of tutorials and manuals so that they can be searchable

We have added HTML versions of the tutorials and manuals with the Python documentation. This makes the tutorial notebooks and manuals searchable as well.

Updated evaluation documents

Documents related to model evaluation have been updated. Please check the latest documents here.

System

16bit support for training on Volta GPU (limited functionality)

This work is rolled over into next release due to dependency on test infrastructure updates.

Support for NCCL 2

Now NCCL can be used across machines. User need to enable NCCL in build configure as here. Note:

  • After installed the downloaded NCCL 2 package, there are two packages:
/var/nccl-repo-2.0.4-ga/libnccl2_2.0.4-1+cuda8.0_amd64.deb
/var/nccl-repo-2.0.4-ga/libnccl-dev_2.0.4-1+cuda8.0_amd64.deb.

Install both of them for building CNTK with NCCL 2.

  • Due to issues in system configuration, user might encounter failure during NCCL initialization. To get detailed information about the failure, please set environment variable NCCL_DEBUG=INFO.
  • There are known issues in current release of NCCL 2 on system configured with InfiniBand devices running in mixed IB and IPoIB modes. To use IB mode devices only, please set environment variable NCCL_IB_HCA=devices running on IB mode, e.g.:
export NCCL_IB_HCA=mlx5_0,mlx5_2

CNTK learner interface update

This update simplifies the learner APIs and deprecates the concepts of unitType.minibatch and UnitType.sample. The purpose is to make the API intuitive to specify the learner hyper-parameters while preserving the unique model update techniques in CNTK --- the mean gradients of every N samples contributes approximately the same to the model updates regardless of the actual data minibatch sizes. Detailed explanation can be found at the manual on How to Use CNTK Learners.

In the new API, all supported learners, including AdaDelta, AdaGrad, FSAdaGrad, Adam, MomentumSGD, Nesterov, RMSProp, and SGD, can now be specified by

cntk.<cntk_supporting_learner>(parameters=model.parametes,
    lr=<float or list>,
    [momentum=<float or list>], [variance_momentum=<float or list>],
    minibatch_size=<None, int, or cntk.learners.IGNORE>,
    ...other learner parameters)

There are two major changes:

  • lr: the learning rate schedule can be specified as a float, a list of floats, or a list of pairs (float, int) (see parameter definition at learning_parameter_schedule). The same specification applies to the momentum and variance_moment of learners, FSAdaGrad, Adam, MomentumSGD, Nesterov, where such hyper-parameters are required.

  • minibatch_size: a minibatch_size can be specified to guarantee that the mean gradient of every N (minibatch_size=N) samples contribute to the model updates with the same learning rate even if the actual minibatch size of the data is different from N. This is useful when the data minibatch size varies, especially in scenarios of training with variable length sequences, and/or uneven data partition in distributed training.

    • If we set minibatch_size=cntk.learners.IGNORE, then we recover the behavior in the literature: The mean gradient of the whole minibatch contributes to the model update with the same learning rate. The behavior of ignoring the data minibatch data size is the same as specifying a minibatch size for the learner when the data minibatch size equals to the specified minibatch size.

With the new API:

  • To have model updates in the same manner as in the classic deep learning literature, we can specify the learner by setting minibatch_size=cntk.learners.IGNORE to ignore the minibatch size, e.g.
sgd_learner_m = C.sgd(z.parameters, lr = 0.5, minibatch_size = C.learners.IGNORE)

note

  • To enable CNTK specific techniques which apply the same learning rate to the mean gradient of every N samples regardless of the actual minibatch sizes, we can specify the learner by setting minibatch_size=N, e.g. setting minibatch_size=2,
sgd_learner_s2 = C.sgd(z.parameters, lr = 0.5, minibatch_size = 2)

Regarding the momentum_schedule of the learners FSAdaGrad, Adam, MomentumSGD, and Nesterov, it can be specified in a similar way. Let's use momentum_sgd as an example:

momentum_sgd(parameters, lr=float or list of floats, momentum=float or list of floats,
             minibatch_size=C.learners.IGNORE, epoch_size=epoch_size)
momentum_sgd(parameters, lr=float or list of floats, momentum=float or list of floats,
             minibatch_size=N, epoch_size=epoch_size)

Similar to learning_rate_schedule, the arguments are interpreted in the same way:

  • With minibatch_size=C.learners.IGNORE, the decay momentum=beta is applied to the mean gradient of the whole minibatch regardless of its size. For example, regardless of the minibatch size either be N or 2N (or any size), the mean gradient of such a minibatch will have same decay factor beta.

  • With minibatch_size=N, the decay momentum=beta is applied to the mean gradient of every N samples. For example, minibatches of sizes N, 2N, 3N and kN will have decays of beta, pow(beta, 2), pow(beta, 3) and pow(beta, k) respectively --- the decay is exponential in the proportion of the actual minibatch size to the specified minibatch size.

A C#/.NET API that enables people to build and train networks.

Training Support Is Added To C#/.NET API.

With this addition to the existing CNTK C# Evaluation API, .NET developers can enjoy fully a integrated deep learning experience. A deep neural network can be built, trained, and validated fully in C# while still taking advantage of CNTK performance strength. Users may debug directly into CNTK source code to see how a DNN is trained and evaluated. New features include:

Basic C# Training API.

Over 100 basic functions are supported to build a computation network. These functions include Sigmoid, Tanh, ReLU, Plus, Minus, Convolution, Pooling, BatchNormalization, to name a few.

As an example, to build a logistic regression loss function:

Function z = CNTKLib.Times(weightParam, input) + biasParam;
Function loss = CNTKLib.CrossEntropyWithSoftmax(z, labelVariable);
CNTK Function As A Primitive Element To Build A DNN

A DNN is built through basic operation composition. For example, to build a ResNet node:

Function conv = CNTKLib.Pooling(CNTKLib.Convolution(convParam, input),
                                PoolingType.Average, poolingWindowShape);
Function resNetNode = CNTKLib.ReLU(CNTKLib.Plus(conv, input));
Batching Support

We provide MinibatchSource and MinibacthData utilities to help efficient data loading and batching.

Training Support

We support many Stochastic Gradient Descent optimizers commonly seen in the DNN literature: MomentumSGDLearner, AdamLearner, AdaGradLearner, etc. For example, to train a model with a ADAM Stochastic Optimizer:

var parameterLearners = new List<Learner>() { Learner.AdamLearner(classifierOutput.Parameters(),
                                                                  learningRate, momentum) };
var trainer = Trainer.CreateTrainer(classifierOutput, trainingLoss,
                                    prediction, parameterLearners);

Training examples cover a broad range of DNN use cases:

R-binding for CNTK

R-binding for CNTK, which enables both training and evaluation, will be published in a separate repository very soon.

Examples

Object Detection with Fast R-CNN and Faster R-CNN

New C++ Eval Examples

We added new C++ examples CNTKLibraryCPPEvalCPUOnlyExamples and CNTKLibraryCPPEvalGPUExamples. They illustrate how to use C++ CNTK Library for model evaluation on CPU and GPU. Another new example is UWPImageRecognition, which is an example using CNTK UWP library for model evaluation.

New C# Eval examples

We added an example for asynchronous evaluation: EvaluationSingleImageAsync(). One thing we shall point out is CNTK C# API does not have an asynchronous method for Evaluate(), because the evaluation is a CPU-bound operation (Please refer to this article for detailed explanation). However, it is desired to run evaluation asynchronously in some use cases, e.g. offloading for responsiveness, we show in the example EvaluationSingleImageAsync() how to achieve that by using the extension method EvaluateAsync(). Please refer to the section Run evaluation asynchronously on the page Using C#/.NET Managed API for details.

Operations

Noise contrastive estimation node

This provides a built-in efficient (but approximate) loss function used to train networks when the number of classes is very large. For example you can use it when you want to predict the next word out of a vocabulary of tens or hundreds of thousands of words.

To use it define your loss as:

loss = nce_loss(weights, biases, inputs, labels, noise_distribution)

and once you are done training you can make predictions like this

logits = C.times(weights, C.reshape(inputs, (1,), 1)) + biases

Note that the noise contrastive estimation loss cannot help with reducing inference costs; the cost savings are only during training.

Improved AttentionModel

A bug in our AttentionModel layer has been fixed and we now faithfully implement the paper

Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et. al.)

Furthermore, the arguments attention_span and attention_axis of the AttentionModel have been deprecated. They should be left to their default values, in which case the attention is computed over the whole sequence and the output is a sequence of vectors of the same dimension as the first argument over the axis of the second argument. This also leads to substantial speed gains (our CNTK 204 Tutorial now runs more than 2x faster).

Aggregation on sparse gradient for embedded layer

This change saves costly conversion from sparse to dense before gradient aggregation when embedding vocabulary size is huge. It is currently enabled for GPU build when training on GPU with non-quantized data parallel SGD. For other distributed learners and CPU build, it is disabled by default. It can be manually turned off in python by calling cntk.cntk_py.use_sparse_gradient_aggregation_in_data_parallel_sgd(False). Note that for a rare case of running distributed training with CPU device on a GPU build, you need to manually turn it off to avoid unimplemented exception

Reduced rank for convolution in C++ to enable convolution on 1D data

Now convolution and convolution_transpose support data without channel or depth dimension by setting reductionRank to 0 instead of 1. The motivation for this change is to add the ability to natively support geometric data without the need to insert a dummy channel dimension through reshaping.

Dilated convolution (GPU only)

We added support for dilation convolution on the GPU, exposed by BrainScript, C++ and Python API. Dilation convolution effectively increase the kernel size, without actually requiring a big kernel. To use dilation convolution you need at least cuDNN 6.0. Dilated convolution improved the result of image segmentation in https://arxiv.org/pdf/1511.07122.pdf, in addition it exponentially increase the receptive field without increasing the required memory. One thing to note is there is currently no implementation of dilated convolution on CPU, therefore you cannot evaluate a model containing dilated convolution on CPU.

Free static axes support for convolution

  • We have added support for free static axes FreeDimension for convolution. This allows changing the input tensor size from minibatch to minibatch. For example, in case of CNNs this allows each minibatch to potentially have a different underlying image size. Similar support has also been enabled for pooling node.
  • Note that the Faster R-CNN example for object detection does not yet leverage the free static axes support for convolution (i.e., still scales and pads input images to a fixed size). This example is being updated to use free static axes for arbitrary input image sizes, and is targeted for next release.

Deterministic Pooling

Now call cntk.debug.force_deterministic() will make max and average pooling deterministic, this behavior depend on cuDNN version 6 or later.

Add Crop Node to Python API

Motivation in order to support some image segmentation network, we added Crop node to C++ and Python API. Crop node crops its first input along spatial axes so that the result matches the spatial size of its second (reference) input. All non-spatial dimensions are unchanged. Crop offsets can be specified directly, or computed automatically, by traversing the network, and matching centers of receptive fields between activations in the two inputs

Performance

Intel MKL update to improve inference speed on CPU by around 2x on AlexNet

This work is rolled over to next release due to dependency on test infrastructure updates.

Keras and Tensorboard

Multi-GPU support for Keras on CNTK.

We added an article to elaborated how to conduct parallel training on CNTK with Keras. Details are here.

Tensorboard image support for CNTK.

We added the image feature support for TensorBoard. Now CNTK users can use TensorBoard to display images. More details and examples can be found here.

Acknowledgments

We thank the following community members for their contributions:+

We apologize for any community contributions we might have overlooked in these release notes.

Others

Continue work on Deep Learning Explained course on edX.