Advice for a New Movie Studio

In my data science course with Flatiron School, our recent project was to pretend that Microsoft was launching a new movie studio, and we students had to make recommendations to the executive team. The project exercised our Python skills and facility with several libraries, including Pandas and visualization libraries. Data for our analysis came from sources such as the Internet Movie Database (IMDB), The Movie Database (TMDB), Box Office Mojo, and Rotten Tomatoes.

My initial survey of the sources found tons of data on individual movies: actors, directors, release dates, and run times. I found ratings from users and numbers of votes cast. The dataframes included rows in Cyrillic about Russian titles and how American blockbusters fared financially in Brazil.

In my role of consultant to this new studio, I decided to focus my analysis on high-level recommendations about genres in the United States market. I figured once the studio understood what sorts of films it generally wanted to make, it could make more granular decisions after. Meanwhile, my project partner focused on financial considerations.

In the end, we recommended that studio look to create movies in the action, adventure, fantasy, and comedy genres and to expect to spend between $100 and $200 million per film to achieve the highest likely returns on investment. (The associated presentation can be found here). While the conclusions we found were interesting, what I enjoyed most about the project was manipulating the data to unearth new insights.

Fan Enthusiasm

One metric I engineered for this project, I called “fan enthusiasm.” It checks the number of user-submitted votes for films in a genre against the total number films made in that genre. So even if the votes aren’t necessary positive for a particular film, users are more inclined to cast votes for films in some genres as opposed to others. In other words, certain genres have enthusiastic fans.

Here was my process for developing this metric:

First, I uploaded the csv files. One file had general info about the movies:

Another concerned their ratings and votes:

And another their regions:

I merged and filtered these dataframes into a single one called us_rated that contained American movies’ titles, genres, number of votes, and average rating. I filtered out duplicates and movies that had fewer than 100 votes, because I didn’t want little seen movies in my data. I ended up with a dataframe that looked like this:

As you can see, movies typically have more than one genre. I wrote the following function to split up them up:

Then I created a new dataframe from us_rated, called genre_pop, using the function and a nested for loop:

The head of my new dataframe looked like this:

As you can see, films are included multiple times, one row for each of their genres. That’s OK because for fan enthusiasm, I’m not interested in individual titles. So then I grouped the dataframe by genre with genre_pop.groupby("genre").mean() to end up with a dataframe head that looks like this:

(Note that I used averagerating for a different path of inquiry for the project, but it didn’t matter for fan enthusiasm.)

I made a dictionary of how many movies were made of each genre:

num_movies = genre_pop.genre.value_counts().to_dict(){'Drama': 5447,
'Comedy': 3081,
'Thriller': 2404,
'Horror': 2229,
'Documentary': 1860,
'Action': 1775,
'Romance': 1271,
'Crime': 1237,
'Adventure': 1037,
'Mystery': 826,
'Sci-Fi': 743,
'Biography': 693,
'Family': 559,
'Fantasy': 508,
'History': 398,
'Animation': 356,
'Music': 342,
'Sport': 240,
'War': 144,
'Western': 91,
'Musical': 89,
'News': 69,
'Game-Show': 1,
'Reality-TV': 1,
'Adult': 1}

My new dataframe, I called ranked_genres and I added the above dictionary to it.

Here’s the new dataframe:

With the code below, I created my new metric enthusiasm_score by dividing the number of votes per genre by the number of films in the genre. So the scores would be more readable, I divided that by 1,000. Then I created my final dataframe called enthusiasm and filtered out genres with few votes.

Here’s the final table:

Now we can recommend to Microsoft that it should consider making movies in genres that people are enthusiastic about. These top five are:

  • Adventure
  • Action
  • Comedy
  • Crime
  • Biography

Last, I used Seaborn to make a visualization of the dataframe.

The project entailed more merging and filtering of dataframes to come up with new insights and creating charts to visualize them. While I doubt Bill Gates will use my recommendations, for me it was worthwhile developing them.




Open to data-related opportunities (data analyst, data scientist, etc.)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Build and Run your Entire End-to-end ML Life-cycle with Scalable Components

Kimball’s and Inmon’s Data warehousing approaches

Quick Tutorial on Numpy

Week of January 24th, 2021:

Hands typing on a laptop.

Comparing Ranked List

Whiskey price model

Udacity’s Data Visualization Nanodegree: Capstone Project

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Henry Alpert

Henry Alpert

Open to data-related opportunities (data analyst, data scientist, etc.)

More from Medium

tqdm in While loops

Basics Of NumPy


Higher Order Functions and Closures in Python

Python While — Loop