# May Papers

Lately I’ve been interested in how machine learning can help artists create animations for characters and critters. So, you’ll find that many of the papers I read in May are connected to this theme. I try to keep the descriptions of the papers brief. Those that particularly inspire me might end up posts of their own in the future.

# Phase-Functioned Neural Networks for Character Control

This is the first paper I read in ML generated animation space. It
sets up the animation network as this neat conditional neural
network. First, they have a function that takes in a phase state
(think of this as a value indicating what point in the cycle of the
animation the character currently is in) and produces the *weights*
for a second neural network that converts the data from last frame +
user input into the motion for the current frame + a change in phase
value. The phase function is modeled using a Catmull-Rom
Spline
that interpolates between 4 sets of expert weights. The cool thing
about the Catmull-Rom spline is that it can create a loop, and the
authors leverage this capability so that the phase function is
cyclic. This is a natural choice for this domain, since the walking
animation for characters are looped.

# Mode-Adaptive Neural Networks for Quadruped Motion Control

This paper is kind of an extension of the phase-functioned paper before it. Instead of using a spline to interplate between experts, the authors use a neural network that predicts mixing values \(\alpha_1, \alpha_2, ..., \alpha_n\) for the expert weights given the motion from the previous frame. These mixing weights are then used to compute the parameters of the neural network \(\theta = \sum_{i=1}^{n} \alpha_i \theta_i\) where \(\theta_i\) are the ith expert’s parameters.

The authors analyze the behavior of their network by inspecting the values of the expert weights over time. There is some interesting periodicity in the behavior of the weights. They also ablate certain experts and observe the resulting effect on the animation of the character. They found certain experts were tailored towards moving left vs right, and some experts focused on high-frequency components of the motion that contribute to the motion’s natural feel. I love this kind of analysis, and I think the author’s did an amazing job.

I feel like this technique could be extended pretty naturally to arbitrary time-series data. Especially time-series data that has a periodic or seasonal component to it.

# Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

After reading the mode adaptive paper, I was wondering how common it
was to mix the weights of the experts rather than the output of the
experts. This lead me to looking at this paper, where the authors try
to train very large mixture of experts neural networks with gating
functions. In this case the gating functions gate the *output* of the
experts, rather than their weights. There is also a big focus on
enforcing sparsity – they truncate the gating values to the \(k\)
largest weights.

This is quite different than the Mode-Adaptive approach, which didn’t care about sparsity nor gating the outputs. I wonder if the mode-adaptive network authors tried this?

# Learning feed-forward one-shot learners

This paper was referenced in the Mode-adaptive NN paper in the
discussion section as a possible technique one could use to generate
walking animations for quadrupeds for which you don’t have motion
capture data. To do so, you assume you already have some diverse
database of motion capture data for various quadrupeds. You then learn
a *one-shot learner* on this training data – a model that learns to
take an exemplar for a type of unknown quadruped and produce the
weights of a NN that would produce the walking animation.

Where \(W'\) are the parameters of the meta network, \(W\) are the parameters used by the walking animation network, \(z\) is an exemplar, and \(x\) is frame data used to produce the animation.

The challenge here, however, is that learning a NN that produces weights for another NN is intractable when done naively, since the output space \(W\) of the meta-network is massive. The author’s, instead of directly building \(W\) off the exemplar, modify the model such that it uses the exemplar to generate the singular values of a factored matrix:

\[ \begin{gather*} W = M_1 \, diag\left(f\left(z; \, W' \right)\right) \, M_2 \\ y = g\left(x; \, W \right) \end{gather*} \]Where \(M_1\) and \(M_2\) are learned matrices independent of the exemplar, and \(f\) is a NN that takes as input the exemplar and produces the diagonal values of matrix.

I thought this paper was neat, but it requires one already having a dataset that is pretty expensive to curate. I also wonder if there are better ways to make the meta-learning problem tractible.

# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

This one is a bit different from the overall theme of the papers I read this month. But, it was hard to pass this paper without reading it, since it has been making such waves in the ML community. The authors set up a clever training procedure that lets them use bi-directional transformers rather than limiting the model to be auto-regressive. The end result is a model that can take in short sequences (~512 tokens) and produce context-aware embeddings for the tokens for the sequence. It also can produce whole-sequence embeddings via the clever use of a [CLS] token.

Overall, their results are a very impressive demonstration of what is possible in the unsupervised learning space when you can train massive models on an absurd amount of data. I suppose technically this model isn’t unsupervised since they train against supervision signals like predicting sequential sentences and predicting masked words. But these signals are pretty weak, and the resulting learned representations are transferable to other tasks.

# Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

I ran into this paper after reading the StyleGAN paper last month. The StyleGAN uses AdaIn (Adaptive instance normalization) layers in its architecture, so I figured it’d be a good idea to learn about where AdaIn came from.

The inspiration for AdaIn seems to come from the “instance normalization layer”, which re-scale’s the input features such that their spatial mean and standard deviation match two learnable parameters \(\gamma\) and \(\beta\).

\[ In\left(x\right) = \gamma \left(\frac{x - \mu(x)}{\sigma(x)} \right) + \beta \]Empirically the style transfer literature has established that the “style” of an image is partially captured by the spatial variation of convolutional features, and that this variation can be summarized with low-order moments like the mean and standard deviation. So, by re-normalizing the features to match a specified mean and standard deviation, you are re-styling the image.

Adaptive instance normalization takes this idea one step further by
removing the learnable parameters and having instance normalization
match the moments coming from the the features of the *style* image
itself.

There is quite a bit more to this paper, especially in the definition of the loss function and the overall model architecture. But, I’ll stop here, otherwise I’d risk writing a whole post on this.