/machine-learning/bootcamp/19-recommender-systems.md

19 - Recommender Systems

For recommender systems is meant a class of algorithms that are used to recommend items to users. The most common example of a recommender system is the recommendation of products to users based on their previous purchases and ratings.
The 2 most common types of recommender systems are:

Content-based recommender systems

Are based on the attributes of the item, and it recommends items that are similar to those that a user has liked in the past.
Ex. Netflix recommends movies based on the genre of the movie that you liked in the past.

Collaborative filtering recommender systems

They use the "wisdom of the crowd" to recommend items.
The idea is that if a user A buys the same items as a user B, then A is likely to buy other items that B has bought but has not yet bought.
Ex. Amazon recommends items based on other users shopping experience.
Collaborative filtering is the most common type of recommender system because gives better results and it's relatively easy to understand.
The algorithm has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use.
They can be divided into 2 categories:

Memory-based collaborative filtering

It uses user rating data to compute the similarity between users or items.

Model-based collaborative filtering

It uses machine learning algorithms to predict users' rating of unrated items.

Python implementation

Scikit-learn does not have a built-in function for recommender systems, but we can use some feature extraction functions to create our own recommender system.

Dataframe

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

column_names = ['user_id', 'item_id', 'rating', 'timestamp']

df = pd.read_csv('u.data', sep='\t', names=column_names)
# connection between movie_id and movie title
movie_titles = pd.read_csv('Movie_Id_Titles')

# concatenate the two dataframes using item_id as FK
df = pd.merge(df, movie_titles, on='item_id')

Data Analysis

Average rating for each movie:

df.groupby('title')['rating'].mean().sort_values(ascending=False).head()

title

They Made Me a Criminal (1939) 5.0

Marlene Dietrich: Shadow and Light (1996) 5.0

Saint of Fort Washington, The (1993) 5.0

Someone Else's America (1995) 5.0

Star Kid (1997) 5.0

Name: rating, dtype: float64


Number of ratings for each movie:

df.groupby('title')['rating'].count().sort_values(ascending=False).head()

title

Star Wars (1977) 584

Contact (1997) 509

Fargo (1996) 508

Return of the Jedi (1983) 507

Liar Liar (1997) 485

Name: rating, dtype: int64


### Recommender System

Create a dataframe with the average rating and the number of ratings for each movie:

ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head()


Add a column with the number of ratings of each movie belong to the average rating:

ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()


#### Plotting

ratings['num of ratings'].hist(bins=70)
ratings['rating'].hist(bins=70)


Relation between the average rating and the number of ratings:

sns.jointplot(x='rating', y='num of ratings', data=ratings, alpha=0.5)


#### Recommending similar movies
Create a matrix with the user_id as index and the title as columns:

X_movie = df.pivot_table(index='user_id', columns='title', values='rating')

star_wars_user_ratings = X_movie['Star Wars (1977)']
star_wars_user_ratings.head()

liar_liar_user_ratings = X_movie['Liar Liar (1997)']
liar_liar_user_ratings.head()


Compute pairwise correlation between rows or columns of two Series:

X_movie.corrwith(star_wars_user_ratings)
similar_to_liar_liar = X_movie.corrwith(liar_liar_user_ratings)


Create a dataframe with the correlation between the movies and the number of ratings and drop the NaN values:

corr_star_wars = pd.DataFrame(similar_to_star_wars, columns=['Correlation'])
corr_star_wars.dropna(inplace=True)
corr_star_wars.head()
corr_star_wars.sort_values('Correlation', ascending=False).head(30)

Hollow Reed (1996) 1.000000

Commandments (1997) 1.000000

Cosi (1996) 1.000000

No Escape (1994) 1.000000

Stripes (1981) 1.000000

Star Wars (1977) 1.000000

Man of the Year (1995) 1.000000

Beans of Egypt, Maine, The (1994) 1.000000

Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991) 1.000000

Outlaw, The (1943) 1.000000

Line King: Al Hirschfeld, The (1996) 1.000000

Hurricane Streets (1998) 1.000000

Scarlet Letter, The (1926) 1.000000

Good Man in Africa, A (1994) 1.000000

Safe Passage (1994) 1.000000

Golden Earrings (1947) 1.000000

Full Speed (1996) 1.000000

Twisted (1996) 1.000000

Ed's Next Move (1996) 1.000000

Mondo (1996) 1.000000

Last Time I Saw Paris, The (1954) 1.000000

Maya Lin: A Strong Clear Vision (1994) 1.000000

Designated Mourner, The (1997) 0.970725

Albino Alligator (1996) 0.968496

Angel Baby (1995) 0.962250

Prisoner of the Mountains (Kavkazsky Plennik) (1996) 0.927173

Love in the Afternoon (1957) 0.923381

'Til There Was You (1997) 0.872872

A Chef in Love (1996) 0.868599

Guantanamera (1994) 0.866025

the result is a list of movies that are similar to Star Wars, but it's not correct because there are movies with only 1 rating that have a 5.0 correlation.
1.0 is the perfect correlation.

corr_liar_liar = pd.DataFrame(similar_to_liar_liar, columns=['Correlation'])
corr_liar_liar.dropna(inplace=True)
corr_liar_liar = corr_liar_liar.join(ratings['num of ratings'])
corr_liar_liar.head()
corr_liar_liar[corr_liar_liar['num of ratings'] > 100].sort_values('Correlation', ascending=False).head()

Correlation num of ratings

title

Liar Liar (1997) 1.000000 485

Batman Forever (1995) 0.516968 114

Mask, The (1994) 0.484650 129

Down Periscope (1996) 0.472681 101

Con Air (1997) 0.469828 137

Python

PythonCIGOmachine-learningnumpypandasscikit-learnseabornrecommender-systemmatplotlibnplnetflixmdcsv