Recommender Systems
I enjoy movies, particularly film critique and cinematography (you should check out the excellent YouTube series Every Frame A Painting for some thoughtful discussions on these topics). When looking for sample data to play with recommender systems, I compiled a list of movies I have seen, and my personal 1-10 rating for them here. Some common approaches to recommender systems are:
Demographic-based Recommendation
This approach recommends movie based on demographics (age, gender, race, etc…) of the user or the user’s friends. It has the benefit of not needing more detailed information from the user. But I’m not comfortable with scraping my friend’s social media feeds, so won’t be using it.
Context-based Recommendation
This is where recommendations would be made based on the user’s current context (eg. what webpages they view, products they buy, etc…). Sounds like it could be very accurate at least in the short term, but I also won’t be using it for reasons similar to above.
Collaborative Filtering
This is where I would look at people who like similar movies to me. The general assumption is that people who liked similar movies in the past will also like similar movies in the future. Collaborative filtering got attention in the wake othe Netflix Prize, where some competitors used matrix factorization on the matrix of reviews. One complication, called the “cold start” problem, is that since not everyone will have seen every movie, this matrix can be very sparse.
For my exercise, MovieLens has 2 free datasets available for public use: a more recent, frequently updated set with ~100,000 ratings and and larger archived set ~27,000,000 ratings. First, I had to clean my data a bit. Turns out, there are quite a few movies with the same titles (eg. “Richard III”, or “Aladdin”) but for simplicity’s sake I simply excluded them.
Then, I ran my data through an Alternating Least Squares (ALS) function. This involved training a model on MovieLens, adding my movies, then generating predictions. The Spark implementation of ALS allows users to deal with cold-start by either allowing NaN values for the prediction, or to drop rows containing them.
Some words about tools: Apache Spark
Since I’d be handling fairly large datasets, I looked into Apache Spark. Spark is written in Scala and its native API is also in Scala. Which means you’re also tied to the JDK for better and worse. Most notably, Spark 2.x only runs on Java 1.8. Depending on your setup, this could be a minor inconvenience, or a better part of an afternoon hunting down unsupported community versions of the JDK and linking your tools. If you don’t want to use the native Scala API there are interfaces for Python (PySpark), and R (SparkR and SparklyR). I went with PySpark.
Job management is done with a web GUI (default http://localhost:4040
), where you can view the queue of current and past jobs. Performance tuning in Spark can be a bit more involved.
Spark’s basic data structure is the Resilient Distributed Dataset (RDD). You can work with RDD directly, but for analyses it’s nicer to use Spark DataFrames. Spark DataFrames have similarities with, but are not entirely equivalent to Pandas dataframes. For one, Spark DataFrames are immutable, since they’re based on RDDs, which are themselves immutable, while Pandas dataframes are mutable. If that’s not enough ways to work, Spark also has an SQL interface.
Spark’s machine learning API is changing from spark.mllib
(supporting RDDs) to spark.ml
(supporting Spark DataFrames), so new code should use the latter, though apparently the former is not deprecated. You can also plug in your favorite deep learning frameworks (Keras, DeepLearning4J) to do deep learning.
Content-based filtering
This is where I would look at find patterns in movies I like, and make recommendations based on those patterns. Here we’re recommending movies with synopses similar to the ones the user likes. The general steps are:
- I obtained an API key from OMDBAPI, which allows 1000 queries/day. Until I figure out how to query multiple movies in the same request, that means a maximum 1000 movies a day.
- Using the
pyCurlrequests
library, I fetched the metadata for movies in my list, specifically full text of the synopses. (pyCurl turned out to be more powerful but much more complicated than I needed) - Reduced the synopsis into keywords using the Rapid Automatic Keyword Extraction (RAKE) algorithm, which is part of the Python Natural Language Toolkit (NLTK).
- Calculate similarity between the movies’ keywords. Cosine similarity is often used, but other metrics could also work.
- Create a
recommendations()
function that gives the most similar movies (shortest cosine distance) to the one the user gives.
Evaluating recommendations
Though accuracy is a traditional metric in recommendations, serendipity or the ability of a recommendation to pleasantly surprise you, is also important, as eloquently explained in Eugene Yan’s blog post.
References
I find Practical Recommender Systems to be a good overview, going from concepts to implementation details like how to track user interactions. As indicated by “practical” in the title, there is heavy emphasis on implementations at Netflix and Amazon, and an accompanying project.