Machine Learning: Classifying Films

In the era of netflix, there’s a wide array of available films to watch online. One of the features that made Netflix so successful was its ability to recommend new movies. This ability is equivalent to answering the following questions:

If a person likes a particular movie, what are some similar movies? If a person likes a given set of movies (and rates them accordingly) what is a good estimate of their rating of another movie?

One method for answering these questions is called K-Nearest-Neighbors, or KNN. The way this works, basically, is that we use a function that calculates the ‘distance’ between two movies. Distance is is quotes because it’s not entirely clear how to do this – and in fact, there are multiple methods. Our method was to compare how similar the ratings were for those two movies over all users. So if movie1 was rated 3 by user1 and 4 by user2, and movie2 was rated 2 by user1 and 4 by user2, then the distance between movie1 and movie2 would be sqrt( (3-2)^2 + (4 – 4)^2) = 1.

Once we have all these distances (from a given movie), we just return some of the movies that had the lowest distances as our recommendations!

For our homework assignment yesterday, we had to write a program that performed this sort of analysis. Here’s an example of the output:

Toy Story:

2.69947506562 Star Wars (1977)
2.85147058824 Return of the Jedi (1983)
3.01114649682 Independence Day (ID4) (1996)
3.1107266436 Rock, The (1996)
3.37230769231 Fargo (1996)
3.41573033708 Mission: Impossible (1996)
3.43060498221 Twelve Monkeys (1995)
3.46153846154 Willy Wonka and the Chocolate Factory (1971)
3.5 Star Trek: First Contact (1996)
3.60740740741 Jerry Maguire (1996)
3.72161172161 Raiders of the Lost Ark (1981)
4.171875 Men in Black (1997)
4.18067226891 Back to the Future (1985)
4.18571428571 Empire Strikes Back, The (1980)
4.2012987013 Twister (1996)