How-to encode time property in recurrent neural networks.

Time-dependent and unequally spaced time encoding for RNNs in Keras.

TL;DR

You have a stream of events that you use for the neural network training. Those events have a duration and/or spaced at non-equal intervals or other time-related property. You understand that dropping this information in counter-productive, but you were unable to find a simple, sane and working method for incorporating this data into existing RNN architecture. Fear no more. Here’s what you need to know:

  1. This paper has all the math, it’s simple — https://www.groundai.com/project/time-dependent-representation-for-neural-event-sequence-prediction/2
  2. I’ve implemented it as a Keras layer in my GitHub

IMHO, this should be of particular interest for event-driven AI and marketing platforms.

Why this is needed?

I’ve been playing around a Datascience Bowl 2019 Kaggle competition. The dataset behind is basically a stream of various app-originated events of mixed granularity. At the moment(Dec 2019) it seems that the dominating approach is a classical “feature engineering — gradient boosting– ensembling” combo. I was curious if we can obtain a comparable or better result with RNNs or Transformers, as more “event-native” (at least it seems so) approach. 

Encoding events themselves isn’t an easy task, taking into the account number of hyperparameters and choices involved, I also stumbled upon an obvious question — how does one encode temporal information about an event? It seems intuitive, that time has a lot of information, especially when connected to the nature of the event.

I decided to write this article, for a number of reasons:

  • it took more than 10 minutes to find a relevant approach, that is strange for the year 2019
  • there’s no Github with the implementation of the mentioned approach
  • it is easy and fun to read and implement, great dive-in for a beginner ML engineer

How to encode time

[1] proposes two approaches for encoding time. 

  1. Contextualising Event Embedding with Time Mask
  2. Event-time joint Embedding

Authors compare these two approaches and show that second one gives a better improvement in performance, so I am focusing on it here.

Any time property characterizing event duration or spacing should be represented as a single scalar for each event. Based on the nature of your data, applying a log transform would be a nice idea. If you are unsure — treat it as a hyperparameter. Here’s how you should prepare your raw data:

where event data is a unique event type, encoded with categorical encoder, and time data is a float, representing time-domain for an event(distance from previous event or time till next or key event)

Event-time joint Embedding. Idea and math.

Architercture from [1] paper. Is this article we omit the next-duration regularization.

We want to embed a single float scalar into an embedding space, using the same principle as for word embeddings. Word embedding works pretty simply, this is basically a learnable hash table, that maps word index in vocabulary to a vector of embeddings without any extra transform.

This works for an integer word index, but how can we apply this technique to a float value, that has continuous and non-linear nature?  

The first step is to transform single scalar in a vector. To do this we multiply a scalar onto a randomly-initialized and trainable vector W and add vector of biases B. 

First scalar time projection to time latent vector space

, where W is as weights matrix of [1,hidden_vector_dimension] shape and B is a bias vector of[hidden_vector_dimension] shape.

As this is a linear transformation it doesn’t have any huge encoding and representational power, you can think of it as a step of learnable a linear scalar to vector transformation. 

Then, we apply a softmax function to the vector the obtained. 

As you know, softmax tends to increase one value in the vector, while decreasing others. Think of this as “this is a most important of the vector”. Then, we use the same approach as in word embeddings, but instead of taking one vector, that corresponds to word index we take all embedding matrix, where we weight all rows in the embedding space, according to values in the vector values, threating them as weights. Voila, we have a time-embedding vector!

Now, we are free to choose how to combine this vector with another stream of data — event embedding obtained using traditional event-type embedding matrix. My first and natural idea was to concatenate them, but authors in [1] propose to take a mean of time-embedding and event-embedding.

Combining time and event embedding data

Event-time joint Embedding. Implementation.

Here, I propose a Keras+Tensorflow 2.0 implementation of this time-embedding layer. Feel free to grab it at my GitHub.

Let’s see, what is going on here.

We init the layer using 2 hyperparameters —the  size of hidden embedding latent space and output size of time-embeded vector.

In the “build” method we randomly initialise weighs, biases and embedding matrix.

In the “call” method we do actual calculations. Projection of scalar and application of softmax function is done using keras backend functions:

x = tf.keras.activations.softmax(x * self.emb_weights + self.emb_biases)

Then we do the most interesting part — the projection of the latent vector onto the embedding matrix. While this can be done with dot product notation its not mathematically correct. A better way to do it would be to use Einstein notation.

Our input would be of shape [batch,timesteps,hidden_vector_dimension] and output is expected to be in form of [batch,timesteps,embedding_matrix_dimensions], where mathematically it looks like this:

Einsum notation to get final time embeddings

Whenever you were able to express this in this form of mathematical summation, you can directly use einsum notation, by just copy of summation indexes into einsum notation.

x = tf.einsum('bsv,vi->bsi',x,self.emb_final)

Here’s the final code.

class TimeEmbedding(Layer):
def __init__(self, hidden_embedding_size, output_dim, **kwargs):
super(TimeEmbedding, self).__init__(**kwargs)
self.output_dim = output_dim
self.hidden_embedding_size = hidden_embedding_size
def build(self, input_shape):
self.emb_weights = self.add_weight(name ='weights',shape=(self.hidden_embedding_size,), initializer='uniform', trainable=True)
self.emb_biases = self.add_weight(name ='biases',shape=(self.hidden_embedding_size,), initializer='uniform', trainable=True)
self.emb_final = self.add_weight(name ='embedding_matrix', shape=(self.hidden_embedding_size,self.output_dim), initializer='uniform', trainable=True)

def call(self, x):
x = tf.keras.backend.expand_dims(x)
x = tf.keras.activations.softmax(x * self.emb_weights + self.emb_biases)
x = tf.einsum('bsv,vi->bsi',x,self.emb_final)
return x

def get_config(self):
config = super(TimeEmbedding, self).get_config()
config.update({'time_dims': self.output_dim, 'hidden_embedding_size':self.hidden_embedding_size})
return config

References

  1. https://arxiv.org/abs/1708.00065

Oleksandr is currently working as a Senior Research Engineer at Ring . Previously was leading an AI laboratory at Skylum Software, founded Let's Enhance.io and Titanovo startups. He is doing research in image processing, marketing intelligence and assorted number of things with power of machine learning

Leave a reply:

Your email address will not be published.

Site Footer