The Limits of Rating Data in Machine Learning

By Nikhil Abraham

Rating data has its limitations in machine learning. For recommender systems to work well, they need to know about you as well as other people, both like you and different from you. Acquiring rating data allows a recommender system to learn from the experiences of multiple customers. Rating data could derive from a judgment (such as rating a product using stars or numbers) or a fact (a binary 1/0 that simply states that you bought the product, saw a movie, or stopped browsing at a certain web page).

No matter the data source or type, rating data is always about behaviors. To rate a movie, you have to decide to see it, watch it, and then rate it based on your experience of seeing the movie. Actual recommender systems learn from rating data in different ways:

  • Collaborative filtering: Matches raters based on movie or product similarities used in the past. You can get recommendations based on items liked by people similar to you or on items similar to those you like.
  • Content-based filtering: Goes beyond the fact that you watched a movie. It examines the features relative to you and the movie to determine whether a match exists based on the larger categories that the features represent. For instance, if you are a female who likes action movies, the recommender will look for suggestions that include the intersection of these two categories.
  • Knowledge-based recommendations: Based on metadata, such as preferences expressed by users and product descriptions. It relies on machine learning and is effective when you do not have enough behavioral data to determine user or product characteristics. This is called a cold start and represents one of the most difficult recommender tasks because you don’t have access to either collaborative filtering or content-based filtering.

When using collaborative filtering, you need to calculate similarity. Apart from Euclidean, Manhattan, and Chebyshev distances, the rest of this information discusses cosine similarity. Cosine similarity measures the angular cosine distance between two vectors, which may seem like a difficult concept to grasp but is just a way to measure angles in data spaces.

Imagine a space made of features and having two points. You can measure the distance between the points. For instance, you could use the Euclidean distance, which is a perfect choice when you have few dimensions, but which fails miserably when you have multiple dimensions because of the curse of dimensionality.

The idea behind the cosine distance is to use the angle created by the two points connected to the space origin (the point where all dimensions are zero) instead. If the points are near, the angle is narrow, no matter how many dimensions are there. If they are far away, the angle is quite large.

Cosine similarity implements the cosine distance as a percentage and is quite effective in telling whether a user is similar to another or whether a film can be associated to another because the same users favor it. The following example locates the movies that are the most similar movies to movie 50, Star Wars.

print (colnames(MovieLense[,50]))

[1] "Star Wars (1977)"

 

similar_movies <- similarity(MovieLense[,50],

MovieLense[,-50],

method ="cosine",

which = "items")

colnames(similar_movies)[which(similar_movies>0.70)]

[1] "Toy Story (1995)"

"Empire Strikes Back, The (1980)"

[3] "Raiders of the Lost Ark (1981)"

"Return of the Jedi (1983)"