The TMDb Database is a powerful, easy-to-use API. New to the world of APIs? Read below for some tips on how to get started!
By now, the middle of quarantine, it is safe to say that I have watched the majority of movies on Netflix. So you can imagine that I was excited to discover that our first Data Science project would involve analyzing movies! And thus began my process of investigating the TMDb API.
Using tmdbsimple Wrapper
One great thing about the TMDb API is that there is a Python wrapper available, called “tmdbsimple” which simplifies the use of the API. Follow the steps found here to install tmdbsimple using your terminal window, and then create an API key that will allow you to access the database.
Getting Started with the API
My first step was to look at the TMDb documentation, which explains the variety of ways you can explore the data. My first attempt was to try the ‘search’ method to see if I could find movies that were produced by specific company, in this case, Sony Pictures.
Amazing! By using the .Search() method, we were able to include a query for “Sony Pictures” and print the first two results. This is a great start in learning how to use the functions within the tmdb wrapper to search for specific categories of movies!
The next step in my process was to browse the .Discover() method to filter results. The example below uses the discover method to look at all films that were released in 2016.
The for loop used prints all the titles for the movies that were in response2. However, we then see that the length of all the results in response2 is 20. That doesn’t make sense, there must be more than 20 films that were released in the year 2016 in the database. Therefore, more exploration will be required in order to retrieve a full list of results.
Using the requests.get() method to access larger datasets
Clearly, we will need to implement a pagination method in order to access more than 20 results at a time. Before I began to explore this, I first decided to try to access the API using the requests.get() method and input different URLs to make different API calls. Below is one example of this, where I use the requests.get() method that looks at the top most popular movies:
The requests.get() method provides similar results as to when we used the .Search() and .Discover() methods. There are many ways to alter the URL in the requests.get() method to sort by different results, this is just one of the many ways to use this powerful function!
Next steps: Collecting Data and Pagination
Now for the exciting part — time to extract a larger dataset! Using a URL similar to the one in the example above that selects movies based on popularity, I will paginate through 500 pages (hence the n<500) to create a list called finallist that will hopefully contain 10,000 results.
And voila! The length of finallist is 10,000, meaning that I was successful in pulling a larger dataset. So previously, I was only analyzing one page of data at a time, but now I have 500 pages!
Final steps: Creating a DataFrame
While having a list is useful, browsing my data in a table format will be helpful in understanding what potential story this data can tell. The TMDb API has an enormous collection of data, and to keep track of what data was collected, it is wise to display the data in a cleaner format.
The DataFrame above contains all columns that were included in the API call, but it is clear that some of these columns will not be necessary in our analysis (i.e. poster_path, backdrop_path, etc.). Let’s quickly pull them out in order to get a cleaner visualization:
Almost there. The “overview” column contains text that won’t be used for our data visualizations purposes, and we see both “original_title” and “title” (we only really need to have one in our DataFrame). Let’s drop these extra two columns and see what is left:
Great! This DataFrame looks pretty good. There is a lot more cleaning that can be done in order to simplify this table that I unfortunately will not have time to cover in this post (stay tuned for my future publications!), but do note that this is just the beginning of a standard data cleaning process! A lot can be done here, but before you begin cleaning and storing your variables and data, play around with the API calls to see other ways to sort and create your dataset!
This article provides just a brief insight into the world of APIs. The TMDb API is one of many free, user-friendly APIs that are available. It is important to remember that each API is different, and it is extremely crucial that you read through the documentation for each API before you begin to play around with the data. I hope this post provided you with a solid introduction into how you would use the TMDb API, and I encourage you to explore this topic further on your own!