Netflix, Spotify, Facebook and many other platforms have powerful algorithms and tools that target specific advertising to their users. What makes these advertisements so powerful is their ability to cater their ads based on users existing preferences. These giants have comprehensive data that have been used to build sophisticated recommendation systems, resulting in effective, targeted recommendations. But how would a new, up-and-coming startup go about creating a recommendation system? How does a company build this from scratch?
The goal of this project was to address this exact problem —effective recommendation systems user mounds of data on users’ existing preferences, typically based on their ratings towards other similar items, in order to produce its results. But what if a company has no data to start with? How can they collect or generate this data? This is otherwise known as the cold-start problem, a common issue seen in the field of recommender systems. Small companies need to address this cold-start problem before they even begin to build their recommendation systems. That is exactly what I aimed to do in this project- create a mechanism to generate a proxy for users preferences if their past preferences are not already known, and then use these generated ratings to make predictions on their tastes and provide recommendations based on this information. Essentially, I mimicked what a new company would need to do if they were creating a recommendation system with no existing data!
My first step in the process was to address the cold-start problem. I ran several machine learning models with the aim of optimizing precision, in order to avoid any false positives in the model. After conducting vanilla models for KNN, Bayes, tree methods, bagging methods, boosting methods and SVM, the strongest result for test precision was XGBoost. I then applied GridSearch to determine the optimal hyperparameters and the result was a test precision score of approximately 0.407, as shown below:
So the next step is, how can I use this model to create ratings for new users? Fortunately, the XGBoost model has a mechanism known as the “predict_proba”, which can provide predicted probabilities for ratings. By utilizing the probability of obtaining certain ratings, it allows the actual ratings to take on non-integer values. This will likely reduce the errors that are found between the actual ratings and the future predicted ratings, as the numbers are not restricted to strict integer values. These probabilistic ratings can then be fed into the recommendation system as a proxy for a user’s ratings, and the recommendations provided will be based off these probabilistic ratings that were created.
So, once I collected these probabilistic ratings, I was ready to build the recommendation system. I conducted KNN Basic, KNN With Means, KNN Baseline and SVD. The SVD model had the lowest RMSE value, and using the probabilistic ratings brought this value down to 0.243 on a 0–5 scale; a very strong result! The model was then able to produce estimated ratings for every combination of user and song; pretty cool, right? These estimated ratings are then able to form recommendations for users; here is an example below:
Through the use of a combined classification model and the SVD model, this project demonstrates how a new business can not only provide recommendations for existing users, but can avoid the cold-start problem and generate recommendations for new users as well. As the business evolves, and more data is collected and generated, the recommendation system can be updated to include a more sophisticated, robust model. However, this classification/SVD combined method is a great way to get started, if prior data is not available. Hope you enjoy your recommendations!