Deezer R&D goes to NIPS 2016

This month, Jimena and Romain from the Deezer R&D team were in Barcelona to attend NIPS — Neural Information Processing Systems, one of the main conferences about artificial intelligence.

Numerous scientists and companies working on machine learning were there: Google, Facebook, DeepMind, Microsoft Research, Amazon, Criteo, as well as numerous universities from all over the world.

This year was a new record of participants with more than 6000 attendees, confirming the importance of this area in the industry and academy.

Many aspects of Artificial Intelligence were addressed, but this year, the hottest topics were the following ones.

NIPS 2016 poster and NIPS logo.

Reinforcement Learning

Reinforcement Learning (RL) systems can learn to solve a complex problem without needing to be explicitly taught how to do it.

Recently, notable problems has been tackled using RL architectures: automatically played ATARI games (from raw game pixels), defeating humans in the Go game, simulated animals that learn to walk and run, robots arms that learn to manipulate object

In an RL framework, there is an agent (for example a player) living in an environment (a game) that must take a decision (ex.: a direction to take) in order to maximize a reward (winning points).

In order to solve the task, the RL system learns a correspondence (a mapping) from a couple (states of the agent, environment observations) to an action to take.

We saw the biggest machine learning companies releasing artificial intelligence platforms where is possible to train RL systems: Universe for Open AI, DeepMind Lab for Google’s DeepMind and Malmon (Minecraft) for Microsoft.

Visualisation of an agent trained with deep reinforcement learning methods at DeepMind (source).

Generative networks

Generative networks are systems that are able to generate data, such as images that look like real images.

Recently, a new way of generating images, based on two neural networks, was proposed: one network is used to generate images, the other one is used to discriminate actual image from images generated by the first one.

The two parts thus act one against the other, the discriminator trying to detect fake images and the generator trying to fool the discriminators: that’s why they are called adversarial.

This kind of architecture was proposed two years ago, but was lacking stability and ability to generate actual images: as can be seen below, images looks quite real when looked from far away, but looked very weird when looked closely.

Lot of papers were thus addressing these issues.

Images generated by a Generative adversarial network (source).

Besides these main topics, others subjects drew our attention:

  • The workshop on extreme classification (that is trying to classify items with an extremely large number of labels) was quite interesting, showing in particular how multi-modal approaches (for example using both texts and images) could result in more accurate classifications than mono-modal ones.
  • A fun and interesting poster was the one presented by people from Boston University and Microsoft Research, where the researchers found that a word embedding space (a popular framework to represent text data as vectors), learned from Google News, contained a direction that encoded gender bias. For example, they found that the embedding revealed implicit sexism in the text, founding a geometric representation of the correspondence man::computer programmer and woman::homemaker. The authors also found a way to remove this biases from the embedding space.
  • An interesting demo was performed by Youtube people. They showed how to learn a video content-based based similarity on the youtube8M dataset. The system was trained to predict ground-truth video relationships (identified by a co-watch-based system) only from visual content.
  • In another interesting work using videos (from Flickr website), researchers from MIT trained a system to solve an acoustic scene/object classification task, using a large data set of unlabeled video. They managed to transfer discriminative visual knowledge from image classification networks into sound space, learning the acoustic representation of natural scenes sound. They used the raw audio as input to the deep network that processed the audio and were able to predict object in videos from audio only.
Some examples of images from videos with sound scenes labels used in “SoundNet: Learning Sound Representations from Unlabeled Video” (source).

So we learned a lot going to NIPS this year and had the chance to share with researchers from other tech companies!

Jimena Royo-Letelier & Romain Hennequin
Research Scientists at Deezer