Netflix Project

Try to analyse, create data visualization, exploratory data analysis, and create a model recommendation from Netflix dataset and additional data from IMDB dataset and Good Books dataset.

Posted by Afdhal Afgani on February 21, 2021

Netflix Project is my own exercise project where I got the data from kaggle. There are 3 datasets that I will use in this project, that data are Netflix Dataset, IMDB Dataset, and Good Books Dataset. First, I want to analyse Netflix Dataset and use IMDB Dataset to add more information regarding the Netflix content (TV Shows and Movies), then I try to see if there are a content that comes from a book.

Project Intro/Objective

The purpose of this project is to analyse Netflix Dataset with help form addtition data, create data visualization, exploratory data analysis, and create a model recommendation for Netflix Content.

Project Library

  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn
  • Plotly
  • Collections
  • WordCloud
  • sklearn (scikit-learn)

Data Preparation

In this section, I want to know the information regarding the Netflix Dataset. First, I want to look at the head of Netflix Dataset.

Then I want to look at the Netflix Dataset information, where I can get information regarding how many data are there for each columns and the data type. After that, I search for null data, is there any null data for each columns.

Data Cleansing and Exploratory Data Analysis

In this section, I want to explore the Netfllix Dataset where I can visualize the data and if needed, I will do some data cleansing.

Comparison of the Number between TV Shows and Movies on Netflix

In this section, I want to see the comparison of the number between TV Shows and Movies on Netflix. First, I want to divide between TV Shows and Movies from Netflix Content, then I will try to analyse for each TV Shows and Movies.

From the code above, I can divide the content between TV Shows and Movies. Then, I try to use countplot to see the difference in the sum between TV Shows and Movies on Netflix.

As you can see from the countplot above, there are more Movies than TV Shows on Netflix.

Date of Release

In this section, I want to know the date of release for TV Shows and Movies, and try to analyse for is there a time where more content released or not.

TV Shows

First, I have to create a data frame that contain information about the date of release like year, month, and the number of TV Shows which released that time. I drop the na columns then extract the year and month information from netflix_shows.

Then, with groupby I create the data frame using the extracted year, month, and list of month for the rows label. Then I use value_counts() to find the number of TV Shows released for a certain date and I fill the empty columns with 0. The result is date_shows_df where we can see how many TV Shows were released for a certain date.

To help me analyse the date of release, I create a heatmap so I can see the date of release for certain date much easier.

Movies

Same with the TV Shows above, I have to create data frame that contains date of release for the movies. I will use netflix movies then extract the year and month.

Then, I create the data frame using the same method as TV Shows above.

Then I create a heatmap so I can see the date of release for certain date much easier.

If we compare these two heatmap, we can see that in the year 2020 that there are still many TV Shows update than Movies update, because maybe that Movies required much budget rather than TV Shows.

In Netflix TV Shows Content Update year 2020, we can see that there are comparetively much more content was released, it is because January is before the pandemic begin and December is after several months of ongoing pandemic and people much more adapt to this pandemic. This pattern not happened in Netflix Movies Content Update.

Rating Analysis

In this section, I want to know the TV Rating for each content then try to find the most TV rating for each content.

From these two countplot, the largest count of Movies and TV Shows are made with the 'TV-MA' rating. 'TV-MA' rating is a rating assigned by TV Parental Guidelines designed for mature audiences only.

Second largest count of Movies and TV Shows is the 'TV-14', a content that may be inappropriate for children younger than 14 years of age.

But for the third largest, we got different result between Movies and TV Shows. Where the third largest of Movies is 'R', a content where children under 17 years old require accompanying parent or adult guardian. And the third largest of TV Shows is 'TV-PG', a content that contain some material that parents or guardians may find inappropriate for younger children.

Preparing Data from Netflix and IMDB dataset

Merge the Dataset then Perform Data Cleansing

First, we read the csv file of IMDb ratings.csv then use only the weighted_average_vote column.

Then, we read the csv file of IMDb moies.csv then use only title, year, and genre columns.

We combine these two data frames to create a new data frame called ratings.

Finally, we drop the duplicates data from the subset of Title, Release Year, and IMDB Rating. We did the drop duplicates to avoid the duplicate information of Movies or TV Shows.

Using .info, I want to know how many entries are there and the data type.

After created the ratings (imdb dataset) data frame, I will merge it with the netflix_df (netflix dataset) to create a data frame that contains Netfilx content and their IMDb information, we called it joint_netflix_imdb.

But, we got some dirty data on joint_netflix_imdb, there are some duplicate columns and the column with same name but contains different information, we need to clean this dirty data.

To clean the data, first we need to drop the duplicate columns that contains same information, so we drop title and Release Year columns.

Then we change all the columns to lowercase and change rating to tv_rating.

Store the result to netflix_imdb_df.

Analysis of Content on Netflix

Top Rated Content on Netflix

Using px.sunburst, we create a pie chart for top rated content and contry origin.

We can see Breakout does not have country information because on our dataset (netflix_df) it does not have information about the country (NaN).

Countries with Highest Rated Content

First, we have to create a new data frame that contains information about the countries and their number of content.

Then using that data frame, we create a funnel plot. We can see that United States and India are the countries with most content on Netflix.

Year Wise Analysis

Using seaborn, we create a countplot. From that plot, we can see that 2018 is the year with most released contents.

Top 10 Movies Creating Countries

In this section, I want to know which country that has created most movies on Netflix. First, we extract the country information from netflix_movies.

After that, we must count the number of the country.

After that, we plot that data using seaborn barplot. From that plot we can see that United States has created more movies than any other countries.

Duration of Movies

In this section, I want to know the lenght for each movies and try to find the good amount of duration for a movies. We extract the duration from netflix_movies and try to clean the data so we get only the minutes of the movies, it makes easier to plot.

We try to plot it using KDE plot so we can see the distribution of the duration for a movies. From KDE plot above, we can see that a good amount of movies duration on Netflix are between 75 - 120 mins.

WordCloud for Movies Genres

I want to create a WordCloud for movies genres, WordCloud will makes easier to see all of the movies genre in one image. First, using collections module, we try to count the genre in netflix_movies.

After that, we plot the WordCloud using wordcloud module. From that WordCloud, we can see more genres only using one image. The bigger the count, the bigger the word.

Lollipop Plot for Movies Genres

I create a lollipop plot to see the count of the movies genres. Lollipop plot is easier to use when we use data from collection module, so our data is from collection module, we can use lollipop plot.

From the Lollipop Plot above, we can see that the top three genres that have the highest amount of content on Netflix is International Movies, Dramas, and Comedies.

Most TV Shows Creating Country

In this section, I want to know which country that has created most tv shows on Netflix. First, we extract the country information from netflix_shows.

After that, we count the number of the country.

After that, we plot that data using seaborn barplot. From that plot we can see that United States top tv shows creating country on Netflix.

TV Shows with Most Seasons

In this section, I want to know which TV Shows that has most seasons on Netflix. Because netflix_shows columns that conatins seasons information not in numeric, we have to convert this. First we create a dummy called features that contain title and duration. Then we put the netflix_shows data frame in this dummy. After that, we replace the string value to empty value.

Finally, we convert that string datatype to integer.

After got the seasons with integer datatype, we sort the value to get the top 20 TV Shows that has most seasons.

Finally, we create a barplot using seaborn. We've got Grey's Anatomy for TV Shows with the most seasons with 16 seasons, also there are 2 others TV Shows that at least has 15 seasons, that TV Shows are NCIS and Supernatural.

TV Shows Season Analysis

After we know the top 20 TV Shows with most seasons, I want to know the distribution off all season. So I create a new data frame.

After that, I create a pie chart. From that Pie Chart, we can see that 66.7% of the TV Shows on Netflix only has 1 season. And we can say that majority of the TV Shows on Netflix has 1 to 3 seasons.

WordCloud for TV Shows Genres

In this section, I want to create a WordCloud for tv shows genres. First, using collections module, we try to count the genre in netflix_movies.

After that, we plot the WordCloud using wordcloud module.

TV Shows in United States

In this section, I want to see the oldest and the latest TV Shows in United States. First, I need to create a oldest tv shows and latest tv shows variable.

Using go.Figure, I create a table for the oldest and latest tv shows in United States. From that table, we can see that the oldest TV Shows in United States is from 1946, the title is Pioneers of African_American Cinema.

Content in Indonesia

In this section, I want to analyse Content in Indonesia. I analyse using director approach, where I try to analyse the content and the its director.

From the figure above, we can see that content in Indonesia is divided by popular director and not so popular director. Riri Riza and Rocky Soraya is the most popular director which each has 5 films on their name. The majority of Indonesia director has 1 content on their name. Then we try to sort the content from the latest year, we see that content from Indonesia on Netflix are all the latest contents, with the oldest content from 2019.

Recommendation System

Using TfidfVectorizer

In this section, I use TF-IDF for recomendation system. The TF-IDF(Term Frequency-Inverse Document Frequency (TF-IDF) ) score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

First, I must import the TfidfVectorizer from scikit-learn (sklearn). Then, I setting the TfidfVectorizer, I set the stop_words as english because all of the data is in english language. Then I replace NAN with an emprty string, after that I fit and transform the netflix_df['description'], because we will use the description to find a film that has similar plot. Lastly, we check the shape of tfidf_matrix. We've got tfidf_matrix that contains 17095 words and 7787 movies title.

After fot the tfidf_matrix, I must cosine_sim and set the get_recommendation function. To set cosine_sim, I import linear_kernel from sklearn then compute the cosine similarity matrix. Then I create indices varieable where it will be use as title index.

For get_recommendation function, we use indices for index. Then get the pairwise similarity scores off all movies then sort it based on similarity scores. Because I want to the recommendation system show only 10 movies, I took only top 10 most similar movies, then get the index and return the movies name.

After create the recommendation system using TfidfVectorizer, we try to use it. From these two examples below, we see that our recommendation is not good enough, so we need to upgrade our model.

Content Based Filtering on Multiple Metrics

I try to add another factor beside the description, I will try to use these factors:

  • Title
  • Cast
  • Director
  • Listed in
  • Plot
Cleaning Data

Before use the data frame to a new model, I have to clean the netflix_df. again because we add new factors. First, I want to fillna with an empty data.

Then, I make a new function called clean_data that will convert all the words to lowercase.

After all the data in lowercase, I try to identifying all the factors and store it in fillna variable.

Then, I applying fillna to the clean_data function.

After got all the lowercase words for fillna, I create the bag of words. To create a bag of words, I create a new function called create_soup then applying fillna to this create_soup function.

Using CountVectorizer

In this section, I use CountVectorizer for recomendation system. CountVectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis).

The reason why I use CountVectorizer eventhough TfidfVectorizer works better that CountVectorizer is because we add more factors and I want to get the fast result because CountVectorizer only count the number of times a word appears in the document and TfidfVectorizer consider overall document weightage of a word.

First, I must import CountVecorizer and cosine_similarity from sklearn. Then use english as stop_words for CountVectorizer and fit_transform using fillna (bag of words), finally I compute the cosine similarity matrix.

For the index, I reset the index then create a new index for the movies.

For get_recommendation function, we use indices for index. Then get the pairwise similarity scores off all movies then sort it based on similarity scores. Because I want to the recommendation system show only 10 movies, I took only top 10 most similar movies, then get the index and return the movies name. This function work same as before.

After create the recommendation system using CountVectorizer, we try to use it. The difference with the TfidfVectorizer is on this get_recommendation, we also input the cosine_sim2, where cosine_sim2 is the new cosine similarity matrix.

Netflix Content from Books?

In this section, I want to know is there a content on Netflix that originates from books. First I load the books.csv data and called it books_df.

Then, I merge the books_df with netflix_df and called the new data frame as netflix_books. Before I merge the data frame, I convert the title on netflix_df and books_df to lowercase to make sure there is no error in merge the data because of the words case.

Finally, I create a Pie Chart from netflix_books using go.Pie. From the Pie chart below, we can see that the majority content on Netflix is not from a book, where Shows From Books is 3.58% and Shows Not From Books is 96.4%.

Additional Resources