Advice for a New Movie Studio

In my data science course with Flatiron School, our recent project was to pretend that Microsoft was launching a new movie studio, and we students had to make recommendations to the executive team. The project exercised our Python skills and facility with several libraries, including Pandas and visualization libraries. Data for our analysis came from sources such as the Internet Movie Database (IMDB), The Movie Database (TMDB), Box Office Mojo, and Rotten Tomatoes.
My initial survey of the sources found tons of data on individual movies: actors, directors, release dates, and run times. I found ratings from users and numbers of votes cast. The dataframes included rows in Cyrillic about Russian titles and how American blockbusters fared financially in Brazil.
In my role of consultant to this new studio, I decided to focus my analysis on high-level recommendations about genres in the United States market. I figured once the studio understood what sorts of films it generally wanted to make, it could make more granular decisions after. Meanwhile, my project partner focused on financial considerations.
In the end, we recommended that studio look to create movies in the action, adventure, fantasy, and comedy genres and to expect to spend between $100 and $200 million per film to achieve the highest likely returns on investment. (The associated presentation can be found here). While the conclusions we found were interesting, what I enjoyed most about the project was manipulating the data to unearth new insights.
Fan Enthusiasm
One metric I engineered for this project, I called “fan enthusiasm.” It checks the number of user-submitted votes for films in a genre against the total number films made in that genre. So even if the votes aren’t necessary positive for a particular film, users are more inclined to cast votes for films in some genres as opposed to others. In other words, certain genres have enthusiastic fans.
Here was my process for developing this metric:
First, I uploaded the csv files. One file had general info about the movies:

Another concerned their ratings and votes:

And another their regions:

I merged and filtered these dataframes into a single one called us_rated
that contained American movies’ titles, genres, number of votes, and average rating. I filtered out duplicates and movies that had fewer than 100 votes, because I didn’t want little seen movies in my data. I ended up with a dataframe that looked like this:

As you can see, movies typically have more than one genre. I wrote the following function to split up them up:
Then I created a new dataframe from us_rated
, called genre_pop
, using the function and a nested for
loop:
The head of my new dataframe looked like this:

As you can see, films are included multiple times, one row for each of their genres. That’s OK because for fan enthusiasm, I’m not interested in individual titles. So then I grouped the dataframe by genre with genre_pop.groupby("genre").mean()
to end up with a dataframe head that looks like this:

(Note that I used averagerating
for a different path of inquiry for the project, but it didn’t matter for fan enthusiasm.)
I made a dictionary of how many movies were made of each genre:
num_movies = genre_pop.genre.value_counts().to_dict(){'Drama': 5447,
'Comedy': 3081,
'Thriller': 2404,
'Horror': 2229,
'Documentary': 1860,
'Action': 1775,
'Romance': 1271,
'Crime': 1237,
'Adventure': 1037,
'Mystery': 826,
'Sci-Fi': 743,
'Biography': 693,
'Family': 559,
'Fantasy': 508,
'History': 398,
'Animation': 356,
'Music': 342,
'Sport': 240,
'War': 144,
'Western': 91,
'Musical': 89,
'News': 69,
'Game-Show': 1,
'Reality-TV': 1,
'Adult': 1}
My new dataframe, I called ranked_genres
and I added the above dictionary to it.
Here’s the new dataframe:

With the code below, I created my new metric enthusiasm_score
by dividing the number of votes per genre by the number of films in the genre. So the scores would be more readable, I divided that by 1,000. Then I created my final dataframe called enthusiasm
and filtered out genres with few votes.
Here’s the final table:

Now we can recommend to Microsoft that it should consider making movies in genres that people are enthusiastic about. These top five are:
- Adventure
- Action
- Comedy
- Crime
- Biography
Last, I used Seaborn to make a visualization of the dataframe.

The project entailed more merging and filtering of dataframes to come up with new insights and creating charts to visualize them. While I doubt Bill Gates will use my recommendations, for me it was worthwhile developing them.