Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

Recommender Systems – Applying Graph and NLP techniques

(By Eugene Yan, republished with permission)

In the previous post, we established that a baseline recommender system (“recsys”) based on matrix factorization is able to achieve an AUC-ROC of ~0.8 (random guessing has AUC-ROC = 0.5)

Can we do better? Also, for curiosity’s sake, are there other approaches to recsys?

As it turns out, yes, and yes.

Natural Language Processing (NLP) and Graphs

In 2013, a breakthrough was made in NLP with the word2vec papers by Tomas Mikolov here and here. They demonstrated that w2v was able to learn semantic and syntactic vector representations of words in an unsupervised fashion.

Simply put, w2v can convert words (or objects) in sequences into a numeric representation, without labels—not needing labels is huge as getting labelled data is a key bottleneck in machine learning.

Another interesting development is DeepWalk which learns representations of online social networks graphs. By performing random walks to generate sequences, the paper demonstrated that it was able to learn vector representations of nodes (e.g., profiles, content) in the graph.

Figure 1a. Arbitrary image of a (social) graph

How do these matter in recsys?

It’s a slight stretch, but here’s the gist of it:

  • Use the product-pairs and associated relationships to create a graph
  • Generate sequences from the graph (via random walk)
  • Learn product embeddings based on the sequences (via word2vec)
  • Recommend products based on embedding similarity (e.g., cosine similarity, dot product)

Ready? Let’s get started.

Creating a graph

We have product-pairs and each of them have an associated score. We can think of them as (graph) edge weights.

With these weights, we can create a weighted graph (i.e., a graph where each edge is given a numerical weight, instead of all edges having the same weight). This can be easily done with networkx.

Violà! We have our product network graph.

Generating sequences

With our product graph, we can generate sequences via random walk.

The direct approach is to traverse the networkx graph. For example, if we want 10 sequences of length 10 for each starting node, we’ll need to traverse the graph 100 times per starting node (also called vertex).

The electronics graph has 420k nodes and the books graph has 2 million nodes. Multiply that by 100 and that’s a lot of graph queries—this is very slow and takes a lot of memory (trust me).

Thankfully, we don’t need to traverse the graph using the networkx APIs.

Using the transition matrix

A graph consists of vertices and edges, where edges are the strength of relationships between each node.

Figure 1b. Example weighted graph

This can be decomposed into an adjacency matrix. If our graph has V nodes, then our adjacency matrix will be V x V in dimension. A regular adjacency matrix has a value of 1 if an edge exists between the nodes, 0 otherwise. Since the graph edges are weighted, the values in the adjacency matrix will be the edge weights.

Figure 1c. Example weighted adjacency matrix

The adjacency matrix needs to be converted to a transition matrix, where the rows sum up to 1.0. Simply put, the transition matrix has the probability of each vertex transitioning to other vertices (thus each row summing to 1).

Figure 1d. Example transition matrix

My first attempt to calculate the transition matrix was using NumPy arrays as suggested here. It didn’t work due to memory constraints.

How can we make this more memory efficient?

Recall that the datasets have 99.99% sparsity. Of 420k electronics nodes, there are only 4 million edges. This means 420k**2 minus 4 million values are zero.

A regular NumPy array would initialize these zeros—which actually can be ignored—thus taking more memory than needed.

My next attempt involved sparse matrices (more about it here). In a nutshell, a sparse matrix doesn’t initialize those unnecessary zeros, saving memory. However, the matrix operations they support are not as extensive.

Converting the adjacency matrix to a transition matrix using sparse matrices worked. I’ll spare you the gory details.

With the transition matrix, I then converted it into dictionary form for the O(1) lookup goodness. Each key would be a node, and the value would be another dictionary of the adjacent nodes and the associated probability.

With this, creating random walks is now a lot simpler and much more efficient. The adjacent node can be identified by applying random.choice on the transition weights. This approach is orders of magnitude faster than traversing a networkx graph (Note: It is still traversing a graph).

Implementation 3: Node2Vec

(Implementation 3?! Where did 1 and 2 go—read the previous post)

Soon after I went through the pain of building my own graph and sequences, I stumbled upon the github repository of Node2Vec (“n2v”) here. Note to self: Google harder before implementing something next time.

n2v was appealing—it seems to work right out of the box. You just need to provide the edges. It would then generate the graph and sequences, and learn node embeddings. Under the hood, it uses networkx and gensim.

Unfortunately, it was very memory intensive and slow, and could not run to completion, even on a 64GB instance.

Digging deeper, I found that its approach to generating sequences was traversing the graph. If you allowed networkx to use multiple threads, it would spawn multiple processes to create sequences and cache them temporarily in memory. In short, very memory hungry. Overall, this didn’t work for the datasets I had.

For the rest of the post, we’ll explore the training of product embeddings based on the sequences generated. To give you a better idea, these sequences are in the form of a NumPy array of objects (i.e., product IDs). The dimension is N x M, where N = number of unique products x 10, and Y = number of nodes per sequence.

Here’s how the sequences look like.

Figure 1e. Array of sequences for electronics dataset (420k unique products)

Implementation 4: gensim.word2vec

Gensim has an implementation of w2v that takes in a list of sequences and can be multi-threaded. It was very easy to use and the fastest to complete five epochs.

It performed significantly better than matrix factorization (in the previous post), achieving an AUC-ROC of 0.9082.

However, if you examine the precision recall curve below (Fig. 2), you’ll notice the sharp cliff at around threshold of 0.73—what’s causing this?

Figure 2. Precision recall curves for gensim.word2vec (all products)

The reason: “Unseen” products in our validation set without embeddings (i.e., they didn’t appear in the training set).

Gensim.w2v is unable to initialize / create embeddings for words that don’t exist in the training data. To get around this, for product-pairs in the validation set where either product doesn’t exist in the train set, we set the prediction score to the median similarity score. It is these unseen products that cause the sharp cliff.

If we only evaluate validation set product-pairs constituents that appeared in the training set, the performance is significantly improved to AUC-ROC = 0.9735 (Fig 3.). Also, no more cliffs.

Figure 3. Precision recall curves for gensim.word2vec (seen products only)

Did I mention this runs in 2.58 minutes using 12 threads? This is the new baseline.

Great results in under 3 minutes—project complete, right?

Well, not really. The intent of this exercise is learning and trying new approaches. I also felt like I haven’t quite fully grokked w2v. Furthermore, with gensim, I couldn’t plot learning curves across batches and epochs.

Also, I’ve a few ideas on how to extend w2v for recsys and the vanilla gensim implementation doesn’t allow for extensions.

Therefore, let’s dive a bit deeper and implement it from scratch, in PyTorch.

Implementation 5: PyTorch word2vec

There are two main components to training a PyTorch model: The dataloader and the model. I’ll start with the more tedious dataloader.

The dataloader is largely similar to the previous implementation (for matrix factorization), with a few key differences. Firstly, instead of taking in product-pairs, it now takes in sequences.

Furthermore, there are two new features (i.e., subsampling of frequent words and negative sampling), both of which were proposed in the second w2v paper.

Subsampling of frequent words

In the paper, subsampling of frequent words (i.e., dropping out words that occur relatively frequently) was found to accelerate learning and significantly improve on learning vectors of rare words. This is fairly straightforward (an excellent explanation available here).

Negative sampling

Explaining negative subsampling is significantly more involved, so bear with me.

The original skip-gram had a (hierarchical) softmax layer at the end, where it outputs the probability of all vocabulary words being in the neighbourhood of the centre word.

If you had a vocabulary of 10k words (or products), that would be a softmax layer with 10k units. With an embedding dimension of 128, this means 1.28 million weights to update—that’s a lot of weights. This problem is more pronounced in recsys, as the “vocabulary” of products is in the millions.

Negative sampling only modifies a small proportion of weights. Specifically, the positive product-pair and a small sample of negative product-pairs. Based on the paper, five negative product-pairs is sufficient for most use cases.

If we have five negative product-pairs, this means we only need to update six output neurons (i.e., 1 positive product-pair, 5 negative product-pairs). Assuming 1 million products, that’s 0.0006% of the weights—very efficient!

(Note: You might have noticed that negative sampling was also adopted in the matrix factorization approach in the previous post, where five negative product-pairs were generated for each positive product-pair during training.)

How are these negative samples selected?

From the paper, they are selected using a unigram distribution, where more frequent words are more likely to be selected as negative samples. One unusual trick was to raise the word counts to the ¾ power, which they found to perform the best. This implementation does the same.

Skipgram model

For the model, the PyTorch w2v implementation is very straightforward and less than 20 lines of code, so I won’t go into explaining the details. Nonetheless, here’s some simplified code of how the skipgram model class looks like:

class SkipGram(nn.Module):
    def __init__(self, emb_size, emb_dim):
        self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
        self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)

    def forward(self, center, context, neg_context):
        emb_center, emb_context, emb_neg_context = self.get_embeddings()

        # Get score for positive pairs
        score = torch.sum(emb_center * emb_context, dim=1)
        score = -F.logsigmoid(score)

        # Get score for negative pairs
        neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
        neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)

        # Return combined score
        return torch.mean(score + neg_score)

How does it perform relative to matrix factorization?

Considering all products, the AUC-ROC achieved was 0.9554 (Fig. 4, significantly better than gensim). If only considering seen products, the AUC-ROC was 0.9855 (Fig. 5, slightly better than gensim).

Figure 4. Precision recall curves for PyTorch word2vec (all products)
Figure 5. Precision recall curves for PyTorch word2vec (seen products only)

The results achieved here are also better than an Alibaba paper that adopted a similar approach, also on an Amazon electronics dataset. The paper reported an AUC-ROC of 0.9327.

When examining the learning curve (Fig. 6), it seems a single epoch is sufficient. What stands out is that each time the learning rate is reset (with each new epoch), the AUC-ROC doesn’t drop drastically as in matrix factorization.

Figure 6. AUC-ROC across epochs for word2vec; a single epoch seems sufficient

Great result overall—it’s able to replicate gensim.word2vec and also does better.

Furthermore, it allows us to initialize embeddings for all products, even those not present in the train set. While those embeddings may be untrained, they can be updated with new data (without having to retrain from scratch).

One downside—it’s nowhere as efficient as the gensim implementation, taking 23.63 minutes. I blame my suboptimal implementation. (Suggestions on how to improve welcome!).

Implementation 6: PyTorch word2vec with side info

One reason why I built w2v from scratch is the intention of extending it by adding side information. For each product in a sequence, we have valuable side information such as brand, category, price, etc. (example below). Why not add this information when learning embeddings?

B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
Television    Sound bar     Lamp              Standing Fan
Sony          Sony          Phillips          Dyson
500 – 600     200 – 300     50 – 75           300 - 400

Implementation-wise, instead of a single embedding per product, we now have multiple embeddings for product ID, brand, category, etc. We can then aggregate these embeddings into a single embedding.

This could potentially help with the cold-start problem; it was also proposed in the Alibaba paper where they used side information for brand, category level 1, and category level 2. By using a similar electronics dataset from Amazon, they reported AUC-ROC improvement (due to side information) from 0.9327 to 0.9575.

I implemented two versions. First, equal-weighted averaging (of embeddings) and second, learning the weightage for each embedding and applying a weighted average to aggregate into a single embedding.

Both didn’t work. AUC-ROC during training dropped to between 0.4 – 0.5 (Fig. 7).

Figure 7. AUC-ROC across epochs for word2vec with side information

After spending significant time ensuring the implementation was correct, I gave up. Trying it with no side information yielded the same results as implementation 5, albeit slower.

One possible reason for this non-result is the sparsity of the meta data. Out of 418,749 electronic products, we only had metadata for 162,023 (39%). Of these, brand was 51% empty.

Nonetheless, my assumption was that the weightage of the embeddings, especially the (less helpful) metadata embeddings, could be learned. Thus, their weight in the aggregate would be reduced (or minimized), and adding metadata should not hurt model performance. Clearly, this was not the case.

All in all, trying to apply w2v with side information didn’t work. ¯_(ツ)_/¯

Implementation 7: Sequences + Matrix Factorization

Why did the w2v approach do so much better than matrix factorization? Was it due to the skipgram model, or due to the training data format (i.e., sequences)?

To understand this better, I tried the previous matrix factorization with bias implementation (AUC-ROC = 0.7951) with the new sequences and dataloader.

Lo and behold, AUC-ROC shot up to 0.9320 (Fig. 8)!

Figure 8. Precision recall curve for PyTorch MF-bias with sequences.

This suggests that the “graph-random-walk-sequences” approach works well.

One reason could be that in the original matrix factorization, we only learnt embeddings based on the product-pairs. But in the sequences, we used a window size of 5, thus we had 5 times more data to learn from, though products that are further apart in the sequences might be less strongly related.

Oddly though, the matrix factorization approach still exhibits the effect of “forgetting” as learning rate resets with each epoch (Fig 9.), though not as pronounced as Figure 3 in the previous post.

I wonder if this is due to using the same embeddings for both center and context.

Figure 9. Learning curve for PyTorch MF-bias with sequences.

Another downside is that it takes about 3x the time, increasing from 23.63 minutes (w2v implementation) to 70.39 minutes (matrix factorization implementation).

Additional results on a bigger dataset (books)

In parallel, I also prepared a books dataset that has 2 million unique products making up 26.5 million product-pairs and applied the implementations on it.

Some notable results are as follows.

Matrix Factorization:

  • Overall AUC-ROC: 0.4996 (not sure why it’s not able to learn)
  • Time taken for 5 epochs: 1353.12 minutes

Gensim Word2vec

  • Overall AUC-ROC: 0.9701
  • Seen-products-only AUC-ROC: 0.9892
  • Time taken for 5 epochs: 16.24 minutes

PyTorch Word2vec

  • Overall AUC-ROC: 0.9775
  • Time taken for 5 epochs: 122.66 minutes

PyTorch Matrix Factorization with Sequences

  • Overall AUC-ROC: 0.7196
  • Time taken for 5 epochs: 1393.08 minutes

Similarly, using sequences with matrix factorization helps significantly, though it doesn’t quite achieve the same stellar results as regular word2vec.

Further Extensions

In these two blog posts, we’ve examined 7 different item-to-item recsys implementations. And we haven’t even considered user information yet—the exploration space is huge!

If we have user data, we can build user embeddings in the same vector space (as product emdeddings). This can be done by training user embeddings based on the products that (i) users click on (positive), (ii) users are exposed to but don’t click on (negative), and (iii) purchase (strong positive).

This approach was adopted by Airbnb and seems promising. However, due to sparsity of user and product data (i.e., majority of users and accommodations have very few bookings), they aggregated users and accommodations into user and accommodation types—this ensures sufficient samples for each type.

Why stop at just user and product embeddings? From this exercise, we’ve seen how powerful embeddings can be with the right training data. In the StarSpace paper, Facebook has taken it to the extreme and propose to embed everything.

Recently, Uber Eats also shared about how they used Graph Neural Networks to learn embeddings for users and menu items, where a node’s embedding is based on it neighbours. More here.

Key Takeaways

Using w2v to generate product embeddings is a very strong baseline and easily beats basic matrix factorization approaches.

If you have the sequences ready, you can just use gensim’s implementation. If you want to extend on w2v and need your own implementation, developing it is not difficult.

It was great that the PyTorch w2v implementation did better than gensim’s, and had better results than the Alibaba paper. Unfortunately, I was unable to replicate the improvements from using side information. Perhaps I need to rerun the experiments on a dataset where the metadata is less sparse to confirm.

Lastly, training data in the form of sequences are epic—matrix factorization with sequences is far better than just matrix factorization with product-pairs. Sequences can be built via multiple novel approaches—in this post, I built a product graph and then performed random walks to generate those sequences.


This has been a fun exercise to apply my interest in sequences to an area of machine learning where they are less commonly seen (i.e., recsys).

I hope the learnings and code shared here will be useful for others who are developing their own recsys implementations, be it for fun or at work.


This article is a reproduction of the original by Eugene Yan. The support of local content creators is part of AI Singapore’s broader mission of building up the nation’s AI ecosystem. If you know of content worth sharing, email us at

Eugene has worked in leading data science positions in Lazada (acquired by Alibaba) and He is also an active member of the data science community in Singapore, mentoring and sharing his experiences at various events.

Recommender Systems : Beyond the user-item matrix

(By Eugene Yan, republished with permission)


Recommender systems (“recsys”) are a fairly old topic that started back in the 1990s. The key idea — we can use opinions and behaviours of users to suggest and personalize relevant and interesting content for them.

Adding yet another post on the standard item-user collaborative filtering wouldn’t contribute much to the hundreds, if not thousands, of posts available.

Thankfully, this is not one of those.


A web search on recommender systems surfaces articles on “collaborative filtering”, “content-based”, “user-item matrix”, etc.

Since then, there has been much progress applying newer techniques for recsys. Unfortunately, easily readable articles or blogs on them are few and far between.

Thus, in this pair of articles, I’ll be sharing about seven different implementations of recsys.

The articles cover the end-to-end, from data acquisition and preparation, and (classic) matrix factorization. It also includes applying techniques from graphs and natural language processing. A result comparison of the different approaches on real-world data will also be discussed (I hope you love learning curves.)

Data Acquisition

We can’t build a recsys without data (well, we can’t build most machine learning systems without data).

Thankfully, Professor Julian McAuley, Associate Professor at UC San Diego has a great collection of data sets (e.g., product reviews, meta data, social networks). For this post, we’ll be using the Amazon dataset (May 1996 – July 2014), specifically, the electronics and books datasets.

Parsing json

All of the datasets are in json format and require parsing into a format that’s easier to work with (e.g., tabular). Some of these datasets can be pretty large, with the largest having 142.8 million rows and requiring 20gb on disk.

Here’s an example of how a json entry for a single product looks like (what we’re interested in is the related field).

"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "",
    { "also_bought":[
      "Toys & Games":211836
"brand": "Coxlures",
	    [ "Sports & Outdoors",
	      "Other Sports",

Needless to say, we’re not going to be able to load this fully into RAM on a regular laptop with 16GB RAM (which I used for this exercise).

To parse the json to csv, I iterated through the json file row by row, converted the json into comma-delimited format, and wrote it out to CSV. (Note: This requires that we know the full keys of the json beforehand).

At the end, we have product metadata in nice tabular format. This includes columns for title and description, brand, categories, price, and related products.

Getting product relationships

To get the product relationships, we need the data in the “related” field. This contains relationships on “also bought”, “also viewed”, “bought together”, etc.

Extracting this requires a bit of extra work as it is contained in a dictionary of lists. Did I mention that it’s also in string format?

To get the relationships, we take the following steps:

  • Evaluate the string and convert to dictionary format
  • Get the associated product IDs for each relationship
  • Explode it so each product-pair is a single row with a single relationship

In the end, what we have is a skinny three column table, with columns for product 1, product 2, and relationships (i.e., also bought, also viewed, bought together). At this point, each product pair can have multiple relationships between them.

Here’s a sample:

product1   | product2   | relationship
B001T9NUFS | B003AVEU6G | also_viewed
0000031895 | B002R0FA24 | also_viewed
B007ZN5Y56 | B005C4Y4F6 | also_viewed
0000031909 | B00538F5OK | also_bought
B00CYBULSO | B00B608000 | also_bought
B004FOEEHC | B00D9C32NI | bought_together

Converting relationships into a single score

Now that we have product-pairs and their relationships, we’ll need to convert them into some kind of single score/weight.

For this exercise, I’ll simplify and assume the relationships are bidirectional (i.e., symmetric in direction). (Note: Product-pair relationships can be asymmetric; people who buy phones would also buy a phone case, but not vice versa).

How do we convert those relationship types into numeric scores?

One simple way is to assign 1.0 if a product-pair have any/multiple relationships, and 0.0 otherwise.

However, this would ignore the various relationship types—should a product pair with an “also bought” relationship have the same score as one that has an “also viewed” relationship?

To address this, weights were assigned based on relationships (bought together = 1.2, also bought = 1.0, also viewed = 0.5) and summed across all unique product pairs. This results in a single numeric score for each product pair (sample below).

product1   | product2   | weight
B001T9NUFS | B003AVEU6G | 0.5
0000031895 | B002R0FA24 | 0.5
B007ZN5Y56 | B005C4Y4F6 | 0.5
0000031909 | B00538F5OK | 1.0
B00CYBULSO | B00B608000 | 1.0
B004FOEEHC | B00D9C32NI | 1.2

For the electronics dataset, there’re 418,749 unique products and 4,005,262 product-pairs (sparsity = 4 million / 420k ** 2 = 0.9999). For books, we have 1,948,370 unique products, and 26,595,848 product pairs (sparsity=0.9999).

Yeap, recommendation data is very sparse.

Train-validation split

The data sets were randomly split into 2/3 train and 1/3 validation. This can be done with sklearn.train_val_split or random indices. Easy-peasy.

Done, amirite?

Unfortunately, not.

You might have noticed that our dataset of product-pairs consists only of positive cases. Thus, for our validation set (and the training set during training time), we need to create negative samples.

Creating negative samples in an efficient manner is not so straightforward.

Assuming we have 1 million positive product-pairs, to create a validation set with 50:50 positive-negative pairs would mean 1 million negative samples.

If we randomly (and naïvely) sample from our product pool to create these negative samples, it would mean random sampling 2 million times. Calling random these many times takes very long—believe me, I (naïvely) tried it.

To get around this, I applied the hack below:

  • Put the set of products in an array
  • Shuffle the array
  • For each iteration, take a slice of the array at the iteration x 2
    • If iteration = 0, product-pair = array[0: 0+2]
    • If iteration = 1, product-pair = array[2:, 2+2]
    • If iteration = 10, product-pair = array[20: 20+2]
  • Once the array is exhausted, (re)shuffle and repeat

This greatly reduced the amount of random operations required and was about 100x faster than the naïve approach.

Collaborating filtering (how it is commonly known)

Collaborative filtering uses information on user behaviours, activities, or preferences to predict what other users will like based on item or user similarity. In contrast, content filtering is based solely on item metadata (i.e., brand, price, category, etc.).

Most collaborative filtering articles are on user-item collaborative filtering (more specifically, matrix factorization). And there’s always an (obligatory) user-item matrix (Fig. 1).

Figure 1. Cliché representation of collaborative filtering

However, if we don’t have user data (like in this case), can this still work?

Why not?

Instead of a user-item matrix, we have an item-item matrix. We then learn latent factors for each item and determine how strongly they’re related to other items.

One way to do this is to load the entire item-item matrix (in memory) and apply something like Python’s surprise or SparkML’s Alternating Least Squares.

However, this can be very resource intensive if you have to load the entire dataset into memory or apply some distributed learning approach.

Our datasets are very sparse (99.99% empty). Can we make use of the sparsity to make it more resource efficient? (Hint: Yes, we can.)

Implementation 1: Matrix Factorization (iteratively pair by pair)

One way to reduce the memory footprint is to perform matrix factorization product-pair by product-pair, without fitting it all into memory. Let’s discuss how to implement this in PyTorch.

First, we load the product-pairs (just the pairs, not the entire matrix) into an array. To iterate through all samples, we just need to iterate through the array.

Next, we need to create negative product-pair samples. Here’s the approach taken to generate negative samples:

  • For each product pair (e.g., 001-002), we take the product on the left (i.e., 001) and pair it with n (e.g., five) negative product samples to form negative product-pairs
  • The negative samples are based on the set of products available, with the probability of each product being sampled proportional to their frequency of occurrence in the training set

Thus, each (positive) product-pair (i.e., 001-002) will also have five related negative product-pairs during training.

Continous labels

For each product-pair, we have numeric scores based on some weightage (bought together = 1.2, also bought = 1.0, also viewed = 0.5). Let’s first train on these continuous labels. This is similar to working with explicit ratings.

Binary labels

To do it in the binary case (such as with implicit feedback), actual scores greater than 0 are converted to 1. Then, a final sigmoid layer is added to convert the score to between 0 – 1. Also, we use a loss function like binary cross entropy (BCE).


Regularization—no doubt it’s key in machine learning. How can we apply it?

One approach is adding L2 regularization to ensure embeddings weights don’t grow too large. We do this by by adding a cost of sum(embedding.weights**2) * C (regularization param) to the total loss.

At a high level, for each pair:

  • Get the embedding for each product
  • Multiply embeddings and sum the resulting vector (this is the prediction)
  • Reduce the difference between predicted score and actual score (via gradient descent and a loss function like mean squared error or BCE)

Here’s some pseudo-code on how iterative matrix factorization would work:

for product_pair, label in train_set:
    # Get embedding for each product
    product1_emb = embedding(product1)
    product2_emb = embedding(product2)

    # Predict product-pair score (interaction term and sum)
    prediction = sig(sum(product1_emb * product2_emb, dim=1))
    l2_reg = lambda * sum(embedding.weight ** 2) 

    # Minimize loss
    loss = BinaryCrossEntropyLoss(prediction, label)
    loss += l2_reg


Training Schedule

For the training schedule, we run it over 5 epochs with cosine annealing. For each epoch, learning rate starts high (0.01) and drops rapidly to a minimum value near zero, before being reset for to the next epoch (Fig. 2).

Figure 2. Cosine Annealing learning rate schedule


For the smaller electronics dataset (420k unique products with 4 million pairs), it took 45 minutes for 5 epochs—doesn’t seem too bad.

Training on binary labels (1 vs. 0), we get a ROC-AUC of 0.8083. Not bad considering that random choice leads to ROC-AUC of 0.5.

From the learning curve (Fig. 3) below, you’ll see that one epoch is sufficient to achieve close to optimal ROC-AUC. Interestingly, each time the learning rate is reset, the model seems to need to “relearn” the embeddings from scratch (AUC-ROC drops to near 0.5).

Figure 3. AUC-ROC across epochs for matrix factorization; Each time learning rate is reset, the model seems to ”forget”, causing AUC-ROC to revert to ~0.5. Also, a single epoch seems sufficient.

However, if we look at the precision-recall curves below (Fig. 4), you see that at around 0.5 we hit the “cliff of death”. If we estimate the threshold slightly too low, precision drops from close to 1.0 to 0.5; slightly too high and recall is poor.

Figure 4. Precision recall curves for Matrix Factorization (binary labels).

Using the continuous labels, performance seems significantly better (AUC-ROC = 0.9225) but we see the same cliff of death (Fig. 5). Nonetheless, this does suggest that explicit (i.e., scale) ratings can perform better than implicit ratings.

Figure 5. Precision recall curves for Matrix Factorization (continuous labels).

This baseline model only captures relationships between each product. But what if a product is generally popular or unpopular?

We can add biases. Implementation wise, it’s a single number for each product.

Implementation 2: Matrix Factorization with Bias

If we just observe the AUC-ROC metric, adding bias doesn’t seem to help, where AUC-ROC decreases from 0.8083 to 0.7951 on binary labels, and from 0.9335 to 0.8319 on continuous labels.

However, if we examine the precision-recall curves, adding bias reduces the steepness of the curves where they intersect, making it more production-friendly (i.e., putting it into production). Here are the curves for both binary (Fig. 6) and continuous labels (Fig. 7).

Figure 6. Precision recall curves for Matrix Factorization + bias (binary labels).
Figure 7. Precision recall curves for Matrix Factorization + bias (continuous labels).

It may seem alarming that AUC-ROC on continuous labels drops sharply (0.9335 to 0.8319), but it shouldn’t be—it’s an artefact of how AUC-ROC is measured.

In the ROC curve for the continuous labels without bias (Fig. 8a), false positives only start to occur after about 0.75 of true positives are identified. Beyond the threshold of 0.5, all false positives are introduced (i.e., precision curve cliff of death in Fig. 5). The ROC curve and AUC-ROC metric doesn’t make this very observable and the AUC-ROC appears significantly better (but it really isn’t).

Figure 8a (left) and 8b (right). ROC curves with and without bias.

On the larger books dataset, it took about 22 hours for 5 epochs—this approach seems to scale more than linearly based on the number of product-pairs (i.e., scales badly). Unfortunately, the AUC-ROC stays around 0.5 during training.

Key takeaways

The (iterative) matrix factorization approach seems to do okay for a baseline, achieving decent AUC-ROC of ~0.8.

While the baseline approach (without bias) does well, it suffers from sharp cliffs on the precision curve. Adding bias improves on this significantly.

More importantly, we learnt that one shouldn’t just look at numeric metrics to understand how a machine learning model is doing. Plotting the curves can be very useful to better understand how your model performs, especially if it’s classification-based and you need to arbitrarily set some threshold.

What’s next?

Wait, isn’t this still matrix factorization, albeit in an iterative, less memory-intensive form? Where’s the graph and NLP approaches? You tricked me!

Yes, the current post only covers the baseline, albeit in an updated form.

Before trying any new-fangled techniques, we have to first establish a baseline, which is what this post does.

In the next post, we’ll apply graph and NLP techniques to the problem of recommender systems.


This article is a reproduction of the original by Eugene Yan. The support of local content creators is part of AI Singapore’s broader mission of building up the nation’s AI ecosystem. If you know of content worth sharing, email us at

Eugene has worked in leading data science positions in Lazada (acquired by Alibaba) and He is also an active member of the data science community in Singapore, mentoring and sharing his experiences at various events.

The AIAP™ Journey : A Sharing from Batch 3

The nine-month journey of the AI Apprenticeship Programme (AIAP)™ ended on 23 December, 2019 for eighteen new graduates from the third iteration of the deep skilling programme. It was a packed 40-week period of their lives, but for many of them, becoming an AI engineer began a long time before that.

I was interested in AI during my final 2 years in university. When I went for interviews for actuarial positions, I was consistently asked if I had AI skills. I reckoned that meant that AI skills will be in huge demand in any industry in the near future.

– Xin Jie (graduate in actuarial science and statistics)

Some were fresh out of school and others had been out in the workforce for many years. What they shared was the belief that AI as a technology will become even more pervasive and has untapped potential to improve lives. What could be more exciting than actually having the chance to work on real-world data with real-world organisations to implement and deploy AI solutions?

A Community of Self-Starters

Joining AIAP™ meant being thrown into a sea of learning. It also meant being part of a community of folks who relish charting out their own personal learning journeys. What are the deliverables? How do they help the client? What relevant knowledge is required? What is the best way to acquire it? What should be proritised? Apprentices in the programme are not expected to sit around waiting for tasks to be distributed. In fact, they play an active role in defining what needs to be done during stand-up meetings which typically take place every other day in an Agile manner.

The programme offers a conducive learning environment with lots of self-directed learners who plan their own learning paths. If this learning mode matches well with you, this place is ideal! Else, this may be misinterpreted as a system lacking in structure and proper direction.

– Siew Lin

Of course, the apprentices did not simply work alone. Each project team, which consisted of two or three apprentices, was closely accompanied throughout the entire journey by a mentor who offered technical guidance and helped the apprentices to hone their critical thinking and problem solving skills. They were also supported by the steady and experienced hands of project managers who ensured that important considerations were not overlooked even as they buried themselves deep in the technical details for the best solutions.

Excellent mentorship makes working a breeze.

– Han Qi

As we enter the decade of the 2020s, work is becoming ever more knowledge-based. Lifelong learning has become an unavoidable part of one’s career. At current rates of knowledge generation, classroom-style learning no longer suffices. Network-based community learning will become commonplace. It was against this backdrop that the apprenticeship programme was conceived. Although the apprentices might be working on different projects in different industries, they were all housed under one roof which cultivated the development of personal relationships and catalysed the exploration of ideas along the hallways. These are important factors that make a vibrant technical community.

There was always someone around who could provide advice or technical help whenever we encountered a problem. Being in an environment where there are multiple teams working on different projects allowed us to bounce ideas off one another very easily.

– Benedict

Hardcore Techies

Git repo, EDA, PCA, Polyaxon, NER, CI/CD, unit test, Docker … these are but a small sample of the terms used in the trade that everyone in the programme became familiar with. There is no escape from the fact that the AI engineer has many concepts and techniques to grasp and the learning never stops.

The world of AI in the year 2019 was dominated by interest in transformer-based language models. The apprentices also took note of this development and self-initiated a series of sessions to read through and share insights on academic papers related to the subject, outside of their usual activities.

Every now and then, my peers are learning about the newest ML (machine learning) paper and sharing them. In a way, we are influenced by colleagues to keep up with the current trend.

– Kian Tian

Soft Skills are important too

While hard technical skills are often lauded, the human side of things cannot be neglected. At the personal level, it meant the ability to soldier on when things do not go as planned.

Resilience in experimenting with solutions; keep trying until a winning idea is adopted!

– Ken

Doing an AI project also meant participation in a team sport. Different players on the team have different mindsets and it is vital to bridge them.

Explaining difficult concepts to clients in layman terms is an important skill for technical communication since many clients (i.e. stakeholders and decision-makers) are often non-technical.

– Shao Xiang

With so many experiences gained in the preceding nine months, it is understandably difficult to convey the full takeaways in a single sentence. Perhaps the following quote comes close.

Too many good things to list – great colleagues, great opportunity, great brand name – all these allowed me to develop my data science skills rapidly and helped me get several offers before my graduation.

– Meraldo

The AIAP™ is the first TechSkills Accelerator Company-Led Training (TeSA-CLT) initiative in AI. This is a collaboration between AI Singapore and IMDA to develop a pipeline of AI professionals for the industry.

Application for the sixth batch of apprentices will open on 3 February, 2020. Click here to find out more.

Related Stories

AIAP™ Batch 3 Graduates!

On 23 December, 2019, we witnessed the graduation of the third batch of apprentices from the AI Apprenticeship Programme (AIAP)™. In his keynote address, Director of AI Industry Innovation, Laurence Liew, thanked the new graduates for their confidence in the programme and for taking the leap of faith to join it full-time, which meant giving up a full-time job for many of them. Their foresight has in fact been affirmed by the recognition of AIAP™ as one of the country winners in the “Talent Accelerator” category in the IDC Digital Transformation Awards (DX Awards) late last year. Much has been learned during the apprenticeship and the spirit of never-stop-learning should never waver, for it was an important criterion for selection into the programme in the first place.

Kevin Lee, one of the more senior graduates and a 25-year veteran of the IT industry, also took to the stage to share his personal thoughts about the preceding nine months of the apprenticeship. Calling it a period of tremendous learning, he was thankful for having had the chance to engage in the end-to-end experience of working on a real-world business problem, translating it into an AI project and deploying the solution in a corporate environment. Through it all, he had been able to experience first-hand the good, the bad and the ugly that are present on the journey of creating value with AI in real-life. Apart from the technical skills acquired, he was impressed with the “can-do” attitude of his younger peers. He shared a particular anecdote about the time he was invited to team up for a hackathon and coding challenge. Looking at the event’s problem description, he questioned whether anyone really knew how to tackle it. The reply was much food for thought for him.

No, we don’t. But we’ll figure it out! ✌️

It was an epiphany of sort for Kevin :

When we were young, in our twenties, we had no fear. We took on challenges… absorb new knowledge like a sponge. But as we age, especially when we are successful in our careers, the more we habituate to our comfort zones. Once we get in there, we will be reluctant to come out… take risks and challenges.

But take on risks and new challenges we must as in this day and age, learning and re-learning is the mantra that sounds ever louder by the day.

Coming in at the other end of the spectrum, Benedict Lee, who recently graduated from university, recalled how intense the start of the apprenticeship had been. For the first two months, the apprentices had to undergo a common training where they had to do assignments based on important concepts across the broad field of AI before they got assigned to their projects. The pace was breakneck and as is often the case in working life, there were new concepts to digest before the current ones had been fully done with.

What really helped, in Benedict’s opinion, was the supportive culture in the programme. Everyone was open to discuss ideas, and simply turning to the person next to you was often enough to resolve or clarify many things.

There had been a lot to learn, a lot to do and quite a bit of fun.

Having successfully completed the gruelling 9-month apprenticeship programme at the end of 2019, we wish our graduates and newly-minted AI engineers success in 2020 and beyond!

The AIAP™ is the first TechSkills Accelerator Company-Led Training (TeSA-CLT) initiative in AI. This is a collaboration between AI Singapore and IMDA to develop a pipeline of AI professionals for the industry.

Application for the sixth batch of apprentices will open on 3 February, 2020. Click here to find out more.

Related Stories

Machines of Loving Grace by John Markoff : Book Review

I came across this book at Kinokuniya and decided to purchase the audio book. What caught my eye and interest was the subtitle “The Quest for Common Ground Between Humans and Robots”. As an economist and a researcher, I am very interested to know more about the future state of the economy and society as we build better artificial intelligence. This is a theme I previously touched on in my earlier book reviews The AI Economy & Data for the People.

The book was written with only two keywords in mind – Artificial Intelligence and Intelligence Augmented. The first keyword refers to the creation of intelligence similar to that of humans which is capable of taking over tasks done by them, while the second describes the enhancement of human capabilities, like what computers and smartphones do.

John Markoff, who is a reporter at New York Times and currently writes for its science section, did a lot of research for this book. In it, he covers many areas of artificial intelligence research like symbolic AI vs connectionism, robotics, driverless cars etc. He also provides a rich history of the people, past and present, active in these different areas – who are the prominent key researchers and leading thought leaders – turning the whole book into a who’s who.

As a researcher in artificial intelligence, it is always good to revisit the history, to understand the thought processes of these leaders and from there extrapolate out to where we are going next and perhaps bring to our attention what some of the current weaknesses we face in the research and engineering front of artificial intelligence are. It is a great book for anyone in this field who would like a list of the thought leaders in the different areas of artificial intelligence and wishes to learn more about whether we are going down the path of artificial intelligence or intelligence augmented.


The audio book opened my horizon wider to how the previous researchers and thought leaders see where artificial intelligence research should be and definitely gave me more material to research on. My rating is 4.5 out of 5 stars for it.

Agile AI Engineering in AISG

At AI Singapore, we are constantly running several 100 Experiments (100E) projects. To ensure smooth and successful delivery, it is essential to have some form of a structured methodology that help guide the team. For this, we adopt well known principles of Lean and Agile development where common frameworks used include Scrum, Kanban and others.

While such practices are prevalent among mature development teams, one of the key challenges we face is that our apprentices come from diverse backgrounds and tend not to have any prior experience in software or AI development. They likely have not worked in a team-based setting which requires close communication. Therefore, easing them into such an environment and helping them to understand the benefits are part of an ongoing process which we are practicing and adapting as well.

Scrum is a popular framework for implementing Agile practices and as this was the best understood concept by the founding team, adopting it to kickstart the 100E projects was the natural thing to do. However, after experimenting with it for a while, we found that the rigidity of a fixed planning cycle was not a particularly good fit for AI development where current tasks can very often uncover new information that affect previously planned tasks. To work around this, some teams started switching to a shorter weekly planning cycle instead of the usual 2 or 3 weeks cycle. The change proved useful and resulted in less churn of planned tasks as we look nearer to the future and focus on answering current questions. 

Typical works involved in AI development can be grouped and roughly visualised in the following diagram.

At any point in time, the engineers and apprentices may be working on several of these. EDA (Exploratory Data Analysis) is where the bulk of the uncertainty lies. These tend to revolve around the quality of the data at hand and often surface during the initial phases of the project when it is not well understood yet. To a lesser extent, a fair amount of uncertainty also surfaces during the modelling and model debugging phase. Frequently, there is a need for further research when the initial approach does not yield good results or has simply been stretched to its full potential. That aside, more typical engineering related work such as building out the data pipeline, creating the scaffolding required to run our modelling experiments as well as building out the CI/CD pipeline are much more certain and amenable to how tasks can be planned out as per usual software engineering projects. In many instances, planning out the tasks to be done revolves around balancing around these 2 key areas.

To enable fast iteration of model building and experimentations, teams are encouraged to quickly build out their AI Pipelines so that they can easily ingest the required data and test out different modelling approaches. To achieve this, we rely on tools such as a good CI/CD platform (eg. Gitlab) as well as an experiment management platform in the form of Polyaxon. A reliable CI/CD pipeline enables us to push changes out for deployment as and when required with the safe knowledge that the built-in sanity checks and tests will help catch any code issues early. On the model development side, Polyaxon allows us to run our experiments which include model training, hyperparameter tuning and evaluation. Metrics tracking is another helpful but basic feature that Polyaxon provides.

Another common challenge that we face in AI Development is data versioning. This is still an evolving area and we are starting to experiment with tools such as DVC and Kedro. Due to the variation in the types of data that are used in our projects, using the data versioning tool that best suits the data for that project is what we recommend as each of them has its own strengths and weaknesses.

Compared to data versioning, model versioning is a much easier proposition as existing practices for versioning software artifacts are usually adequate. In this case, having a good CI/CD pipeline with a proper artifact publishing mechanism that tags the artifacts appropriately are usually sufficient for the project needs.

Coming up next, we aim to introduce a more test-driven approach to building up our AI pipelines. The challenge is to balance between experimental code that lives in notebooks vs production ready pipelining code. In order to avoid unnecessary frequent code reworks, the core part of the pipeline code should be relatively stable which will then allows us to write more meaningful tests that correctly assert its functionality. As we start to explore this space, frameworks such as Kedro are helpful here as they enable us to write our transformation functions in isolation and chain them together to form the desired pipeline.

Finally, going the full Kanban style where we drop fixed planning sessions in favour of more ad hoc design sessions is also something on the card. As we embark on a brand new year, I’m looking forward to continually improve on our engineering practices. If all this sounds exciting to you, drop by and have a chat with us or leave a comment. We are always on the lookout for talents to help us build the AI ecosystem and bring AI development in Singapore to a higher level.

mailing list sign up

Mailing List Sign Up C360