Demystifying Neural Artistic Style Transfer

Hi! I'm Galuh.

Currently 💼 Data Engineer in Midtrans

Previously 🎓 Fasilkom UI 2014

Wait, isn't this like a developers meetup?



The algorithm behind DeepArt.io and Prisma

What is neural artistic style transfer?

+
=

Image source: arXiv

What we'll learn today (hopefully)

  1. How is the problem formulated?
  2. What computers have to learn to solve this problem
  3. Code! 💻

Step 1: Formulate the Problem

Step 1: Formulate the Problem

Content Image

+

Style Image

=

Generated Image

Step 1: Formulate the Problem

Content Image

What we want

Buildings, trees, river...

What we don't want

Exact colors, textures...

Step 1: Formulate the Problem

Style Image

What we want

Colors (lots of blue? Some hint of yellow?), textures (the famous Van Gogh thick brush strokes)...

What we don't want

Objects in the painting (mountains, houses)

Step 1: Formulate the Problem

Result

What we want

A new picture which content is closest to our Content Image and which style is closest to our Style Image

Step 1: Formulate the Problem

How to measure "close"

Euclidean distance!

Step 1: Formulate the Problem

This is an optimization problem!!!

Step 1: Formulate the Problem

Init: some random white noise image

Result

Step 1: Formulate the Problem

Result

... but I have more questions. How can the computer even?!?!?!?

We now know what to do, but how can the computer accomplish all these things?

What does the computer need to do? How does the computer know which one is the content? How does the computer learn which one is the style? I HAZ SO MANY QUESTIONS.

Step 2: What computers have to learn & how

A (very) crash course on neural networks

A (very) crash course on neural networks

Source: Becoming Human

  • For images, conventional neural networks is computationally expensive (a 30x30 image still needs 900 inputs!)

A (very) crash course on convolutional neural networks

A (very) crash course on convolutional neural networks

Source: MathWorks

A (very) crash course on convolutional neural networks

VGG

  • One kind of ConvNet architecture
  • Winner of the 2014 ImageNet challenge
  • Used in this paper to extract content and style from the input images
  • Find out more

How can computers see the content?

  • The higher layers detect the more higher-level features
  • ... therefore are good layers to extract our content!
  • We'll lose the exact pixel information, but it's OK (in fact, we don't need 'em)

How close is our generated image to our content image?


We calculate the Euclidean distance between the corresponding feature map of our content image and feature map of our generated image.

How close is our generated image to our style image?


It's not as straightforward, but it's OK!

The Euclidean distance still comes in handy, but instead of calculating the distance between feature maps, we'll be calculating the distance between the Gram matrices of a feature map.

Gram matrices-a-what?

Refresher: Gram matrix = a matrix multiplied by its transpose

We'll be looking at the correlations between feature responses in an image. In each layer, we multiply all feature maps point-wise.

Another way to think of it: the spatial information of our image is distributed, because every column is multiplied with every row in the matrix.

Gram matrices-a-what?

Wrapping everything up


We can adjust the parameters, depending on our preference (more style? More content?)

Now what?


We need to do some optimization to minimize the loss iteratively. This will give some kind of direction for our initial random image to improve towards a generated image with minimum loss.

The paper uses an optimization algorithm called Limited-memory BFGS (L-BFGS).

Step 3: Code 💻

Jupyter Notebook

This is the annotated version of Gatys et al.'s code. My own implementation was too messy to be presented and I didn't have time to tidy everything up, unfortunately.

Is that all?

No! Check these out:

Real-time style transfer!

Style transfer for videos!

That's it. 🎉 Thanks!