AI Blueprints: Implementing content-based recommendations using Python

This is an excerpt from Packt’s latest book, AI Blueprints written by Dr. Joshua Eckrot.

In this article, we’ll have a look at how you can implement a content-based recommendation system using Python and the scikit-learn library. But before diving straight into this, it’s important to have some prerequisite knowledge of the different ways by which recommendation systems can recommend an item to users.

There are two ways to recommend an item:

Content-based: A content-based recommendation finds similar items to a given item by examining the item’s properties, such as its title or description, category, or dependencies on other items (for example, electronic toys require batteries). These kinds of recommendations do not use any information about ratings, purchases, or any other user information. For example, let’s suppose we know that a user is viewing a particular camera or a particular blues musician. We can generate recommendations by examining the item’s (camera’s or musician’s) properties and the user’s stated interests. For example, a database could help generate recommendations by selecting lenses compatible with the camera or musicians in the same genre or a genre that the user has selected in their profile. In a similar context, items can be found by examining the items’ descriptions and finding close matches with the item the user is viewing. These are all a kind of content-based recommendation.
Collaborative: Collaborative filtering uses feedback from other users to help determine the recommendation for this user. Other users may contribute ratings, “likes,” purchases, views, and so on. Sometimes, websites, such as Amazon, will include a phrase like, “Customers who bought this item also bought…” Such a phrase is a clear indication of collaborative filtering. In practice, collaborative filtering is a means for predicting how much the user in question will like each item, and then filtering down to the few items with the highest-scoring predictions.

Now that we have got an idea of what content-based recommendation is, let’s go though it’s implementation.

Implementation

Suppose we wish to find similar items by their titles and descriptions. In other words, we want to examine the words used in each item to find items with similar words. We will represent each item as a vector and compare them with a distance metric to see how similar they are, where a smaller distance means they are more similar.

We can use the bag of words technique to convert an item’s title and description into a vector of numbers. This approach is common for any situation where text needs to be converted to a vector. Furthermore, each item’s vector will have the same dimension (same number of values), so we can easily compute the distance metric on any two item vectors.

The bag of words technique constructs a vector for each item that has as many values as there are unique words among all the items. If there are, say, 1000 unique words mentioned in the titles and descriptions of 100 items, then each of the 100 items will be represented by a 1000-dimension vector. The values in the vector are the counts of the number of times an item uses each particular word. If we have an item vector that starts <3, 0, 2, …>, and the 1000 unique words are “aardvark, aback, abandoned, …”, then we know the item uses the word aardvark 3 times, the word aback 0 times, the word abandoned 2 times, and so on. Also, we often eliminate “stop words,” or common words in the English language, such as “and,” “the,” or “get,” that have little meaning.

Given two item vectors, we can compute their distance in multiple ways. One common way is Euclidean distance:

where x_iand y_irefer to each value from the first and second items’ vectors. Euclidean distance is less accurate if the item titles and descriptions have a dramatically different number of words, so we often use cosine similarity instead. This metric measures the angle between the vectors. This is easy to understand if our vectors have two dimensions, but it works equally well in any number of dimensions. In two dimensions, the angle between two item vectors is the angle between the lines that connect the 0,0 and the item vector values, . Cosine similarity is calculated as

ai blueprint

where x and y are n-dimensional vectors and ‖x‖ and ‖y‖ refer to the “magnitude” of a vector, that is, its distance from the origin,

ai blueprint

Unlike Euclidean distance, larger values are better with cosine similarity because a larger value indicates the angle between the two vectors is smaller, so the vectors are closer or more similar to each other (recall that the graph of cosine starts at 1.0 with angle 0.0). Two identical vectors will have a cosine similarity of 1.0. The reason it is called the cosine similarity is because we can find the actual angle by taking the inverse cosine of d : Θ=d . We have no reason to do so since d works just fine as a similarity value.

SEE ALSO: Know your history — Artificial intelligence

Now we have a way of representing each item’s title and description as a vector, and we can compute how similar two vectors are with cosine similarity. Unfortunately, we have a problem. Two items will be considered highly similar if they use many of the same words even if those particular words are very common. For example, if all video items in our store have the word “Video” and “[DVD]” at the end of their titles, then every video might be considered similar to every other. To resolve this problem, we want to penalize (reduce) the values in the item vectors that represent common words.

A popular way to penalize common words in a bag of words vectors is known as Term Frequency-Inverse Document Frequency (TF-IDF). We recompute each value by multiplying a weight that factors in the commonality of the word. There are multiple variations of this reweighting formula, but a common one works as follows. Each value x_i in the vector is changed to where N is the number of items (say, 100 total items) and F(x_i) gives the count of items (out of the 100) that contain the word x_i. A word that is common will have a smaller N/F(x_i) factor so its weighted value xˆ_iwill be smaller the original x_i. We use the log() function to ensure the multiplier does not get excessively large for uncommon words. It’s worth noting that N/F(x_i) ≥ 1, and in the case when a word is found in every item AI BLUEPRINT

the 1 + in front of the log() ensures the word is still counted by leaving x_i unchanged.

Now we have properly weighted item vectors and a similarity metric, the last task is to find similar items with this information. Let’s suppose we are given a query item; we want to find three similar items. These items should have the largest cosine similarity to the query item. This is known as a nearest neighbor search. If coded naively, the nearest neighbor search requires computing the similarity from the query item to every other item. A better approach is to use a very efficient library such as Facebook’s faiss library. faiss precomputes similarities and stores them in an efficient index. It can also use the GPU to compute these similarities in parallel and find nearest neighbors extremely quickly.

There is one last complication. The bag of words vectors, even with stop words removed, is very large, and it’s is not uncommon to have vectors with 10k to 50k values, given how many English words may be used in an item title or description. The faiss library does not work well with such large vectors. We can limit the number of words, or a number of “features,” with a parameter to the bag of words processor. However, this parameter keeps the most common words, which is not necessarily what we want; instead, we want to keep the most important words. We can reduce the size of the vectors to just 100 values using matrix factorization, specifically the singular-value decomposition (SVD).

Summary

In this article, we had a look at the different ways of recommending an item to a user, and learnt how to develop a content-based recommendation system to find similar items based on the items’ titles and descriptions.

AI Blueprints published by Packt and written by Dr. Joshua Eckroth gives you a working framework and the techniques to build your own successful AI business applications.

You can read a preview of the book here.

The post AI Blueprints: Implementing content-based recommendations using Python appeared first on JAXenter.

Source : JAXenter

AI Blueprints: Implementing content-based recommendations using Python

SEE ALSO: AI as smart services for everyone

Implementation

SEE ALSO: Know your history — Artificial intelligence

SEE ALSO: How AI will change software development processes

SEE ALSO: Where will automation and AI go in 2019?

SEE ALSO: Python’s growth comes from the enormous expansion of data science and machine learning

Summary

You may also like...

Random Post

Recent

AI Blueprints: Implementing content-based recommendations using Python

SEE ALSO: AI as smart services for everyone

Implementation

SEE ALSO: Know your history — Artificial intelligence

SEE ALSO: How AI will change software development processes

SEE ALSO: Where will automation and AI go in 2019?

SEE ALSO: Python’s growth comes from the enormous expansion of data science and machine learning

Summary

You may also like...

Daily Crunch: Hands-on with the Samsung Galaxy Fold

Saturday’s Best Deals: Ring Gear, Electric Insect Repellent, Nordstrom Anniversary Sale, And More

The Man Farthest Down: A Record of Observation and Study in Europe by Park and Washington

Random Post

Recent

Tags