May 18, 2019

May Papers

Lately I’ve been interested in how machine learning can help artists create animations for characters and critters. So, you’ll find that many of the papers I read in May are connected to this theme. I try to keep the descriptions of the papers brief. Those that particularly inspire me might end up posts of their own in the future.

Phase-Functioned Neural Networks for Character Control

This is the first paper I read in ML generated animation space. It sets up the animation network as this neat conditional neural network. First, they have a function that takes in a phase state (think of this as a value indicating what point in the cycle of the animation the character currently is in) and produces the weights for a second neural network that converts the data from last frame + user input into the motion for the current frame + a change in phase value. The phase function is modeled using a Catmull-Rom Spline that interpolates between 4 sets of expert weights. The cool thing about the Catmull-Rom spline is that it can create a loop, and the authors leverage this capability so that the phase function is cyclic. This is a natural choice for this domain, since the walking animation for characters are looped.

Mode-Adaptive Neural Networks for Quadruped Motion Control

This paper is kind of an extension of the phase-functioned paper before it. Instead of using a spline to interplate between experts, the authors use a neural network that predicts mixing values α1,α2,...,αn_1, _2, …, nα1,α2,...,αn for the expert weights given the motion from the previous frame. These mixing weights are then used to compute the parameters of the neural network θ=i=1nαiθi= {i=1}^{n} _i _iθ=i=1nαiθi where θi_iθi are the ith expert’s parameters.

The authors analyze the behavior of their network by inspecting the values of the expert weights over time. There is some interesting periodicity in the behavior of the weights. They also ablate certain experts and observe the resulting effect on the animation of the character. They found certain experts were tailored towards moving left vs right, and some experts focused on high-frequency components of the motion that contribute to the motion’s natural feel. I love this kind of analysis, and I think the author’s did an amazing job.

I feel like this technique could be extended pretty naturally to arbitrary time-series data. Especially time-series data that has a periodic or seasonal component to it.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

After reading the mode adaptive paper, I was wondering how common it was to mix the weights of the experts rather than the output of the experts. This lead me to looking at this paper, where the authors try to train very large mixture of experts neural networks with gating functions. In this case the gating functions gate the output of the experts, rather than their weights. There is also a big focus on enforcing sparsity — they truncate the gating values to the kkk largest weights.

This is quite different than the Mode-Adaptive approach, which didn’t care about sparsity nor gating the outputs. I wonder if the mode-adaptive network authors tried this?

Learning feed-forward one-shot learners

This paper was referenced in the Mode-adaptive NN paper in the discussion section as a possible technique one could use to generate walking animations for quadrupeds for which you don’t have motion capture data. To do so, you assume you already have some diverse database of motion capture data for various quadrupeds. You then learn a one-shot learner on this training data — a model that learns to take an exemplar for a type of unknown quadruped and produce the weights of a NN that would produce the walking animation.

W=f(z;W)y=g(x;W)Wy=f(z;W)=g(x;W)

Where WW’W are the parameters of the meta network, WWW are the parameters used by the walking animation network, zzz is an exemplar, and xxx is frame data used to produce the animation.

The challenge here, however, is that learning a NN that produces weights for another NN is intractable when done naively, since the output space WWW of the meta-network is massive. The author’s, instead of directly building WWW off the exemplar, modify the model such that it uses the exemplar to generate the singular values of a factored matrix:

W=M1diag(f(z;W))M2y=g(x;W)Wy=M1diag(f(z;W))M2=g(x;W)

Where M1M_1M1 and M2M_2M2 are learned matrices independent of the exemplar, and fff is a NN that takes as input the exemplar and produces the diagonal values of matrix.

I thought this paper was neat, but it requires one already having a dataset that is pretty expensive to curate. I also wonder if there are better ways to make the meta-learning problem tractible.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

This one is a bit different from the overall theme of the paper’s I read this month. But, it was hard to pass this paper without reading it, since it has been making such waves in the ML community. The authors set up a clever training procedure that lets them use bi-directional transformers rather than limiting the model to be auto-regressive. The end result is a model that can take in short sequences (~512 tokens) and produce context-aware embeddings for the tokens for the sequence. It also can produce whole-sequence embeddings via the clever use of a [CLS] token.

Overall, their results are a very impressive demonstration of what is possible in the unsupervised learning space when you can train massive models on an absurd amount of data. I suppose technically this model isn’t unsupervised since they train against supervision signals like predicting sequential sentences and predicting masked words. But these signals are pretty weak, and the resulting learned representations are transferable to other tasks.

Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

I ran into this paper after reading the StyleGAN paper last month. The StyleGAN uses AdaIn (Adaptive instance normalization) layers in its architecture, so I figured it’d be a good idea to learn about where AdaIn came from.

The inspiration for AdaIn seems to come from the instance normalization layer”, which re-scale’s the input features such that their spatial mean and standard deviation match two learnable parameters γγ and ββ.

In(x)=γ(xμ(x)σ(x))+βIn(x)=γ(σ(x)xμ(x))+β

Empirically the style transfer literature has established that the style” of an image is partially captured by the spatial variation of convolutional features, and that this variation can be summarized with low-order moments like the mean and standard deviation. So, by re-normalizing the features to match a specified mean and standard deviation, you are re-styling the image.

Adaptive instance normalization takes this idea one step further by removing the learnable parameters and having instance normalization match the moments coming from the the features of the style image itself.

AdaIn(x,s)=σ(s)(xμ(x)σ(x))+μ(y)AdaIn(x,s)=σ(s)(σ(x)xμ(x))+μ(y)

There is quite a bit more to this paper, especially in the definition of the loss function and the overall model architecture. But, I’ll stop here, otherwise I’d risk writing a whole post on this.


Machine learning Computer graphics Monthly papers


Previous post
Multi-headed attention as matrix multiplication Today I’ll walk through how to implement multi-headed attention with a series of matrix multiplications. Computing attention in this way is more