Dating apps that use algorithms - consider
Preparing the Profile Data
Dating is rough for the single person. Dating apps can be even rougher. The algorithms dating apps use are largely kept private by the various companies that use them. Today, we will try to shed some light on these algorithms by building a dating algorithm using AI and Machine Learning. More specifically, we will be utilizing unsupervised machine learning in the form of clustering.
Hopefully, we could improve the process of dating profile matching by pairing users together by using machine learning. If dating companies such as Tinder or Hinge already take advantage of these techniques, then we will at least learn a little bit more about their profile matching process and some unsupervised machine learning concepts. However, if they do not use machine learning, then maybe we could surely improve the matchmaking process ourselves.
The idea behind the use of machine learning for dating apps and algorithms has been explored and detailed in the previous article below:
This article dealt with the application of AI and dating apps. It laid out the outline of the project, which we will be finalizing here in this article. The overall concept and application is simple. We will be using K-Means Clustering or Hierarchical Agglomerative Clustering to cluster the dating profiles with one another. By doing so, we hope to provide these hypothetical users with more matches like themselves instead of profiles unlike their own.
Now that we have an outline to begin creating this machine learning dating algorithm, we can begin coding it all out in Python!
To begin, we must first import all the necessary libraries we will need in order for this clustering algorithm to run properly. We will also load in the Pandas DataFrame, which we created when we forged the fake dating profiles.
With our dataset good to go, we can begin the next step for our clustering algorithm.
Scaling the Data
The next step, which will assist our clustering algorithm’s performance, is scaling the dating categories (Movies, TV, religion, etc). This will potentially decrease the time it takes to fit and transform our clustering algorithm to the dataset.
# Instantiating the Scalerscaler = MinMaxScaler()# Scaling the categories then replacing the old values
df = df[['Bios']].join(
pd.DataFrame(
scaler.fit_transform(
df.drop('Bios',axis=1)),
columns=df.columns[1:],
index=df.index))
Vectorizing the Bios
Next, we will have to vectorize the bios we have from the fake profiles. We will be creating a new DataFrame containing the vectorized bios and dropping the original ‘Bio’ column. With vectorization we will implementing two different approaches to see if they have significant effect on the clustering algorithm. Those two vectorization approaches are: Count Vectorization and TFIDF Vectorization. We will be experimenting with both approaches to find the optimum vectorization method.
Here we have the option of either using or for vectorizing the dating profile bios. When the Bios have been vectorized and placed into their own DataFrame, we will concatenate them with the scaled dating categories to create a new DataFrame with all the features we need.
Based on this final DF, we have more than 100 features. Because of this, we will have to reduce the dimensionality of our dataset by using Principal Component Analysis (PCA).
PCA on the DataFrame
In order for us to reduce this large feature set, we will have to implement Principal Component Analysis (PCA). This technique will reduce the dimensionality of our dataset but still retain much of the variability or valuable statistical information.
What we are doing here is fitting and transforming our last DF, then plotting the variance and the number of features. This plot will visually tell us how many features account for the variance.
After running our code, the number of features that account for 95% of the variance is 74. With that number in mind, we can apply it to our PCA function to reduce the number of Principal Components or Features in our last DF to 74 from 117.These features will now be used instead of the original DF to fit to our clustering algorithm.
With our data scaled, vectorized, and PCA’d, we can begin clustering the dating profiles. In order to cluster our profiles together, we must first find the optimum number of clusters to create.
Evaluation Metrics for Clustering
The optimum number of clusters will be determined based on specific evaluation metrics which will quantify the performance of the clustering algorithms. Since there is no definite set number of clusters to create, we will be using a couple of different evaluation metrics to determine the optimum number of clusters. These metrics are the Silhouette Coefficient and the Davies-Bouldin Score.
These metrics each have their own advantages and disadvantages. The choice to use either one is purely subjective and you are free to use another metric if you choose.
Finding the Right Number of Clusters
Below, we will be running some code that will run our clustering algorithm with differing amounts of clusters.
By running this code, we will be going through several steps:
- Iterating through different quantities of clusters for our clustering algorithm.
- Fitting the algorithm to our PCA’d DataFrame.
- Assigning the profiles to their clusters.
- Appending the respective evaluation scores to a list. This list will be used later to determine the optimum number of clusters.
Also, there is an option to run both types of clustering algorithms in the loop: Hierarchical Agglomerative Clustering and KMeans Clustering. There is an option to uncomment out the desired clustering algorithm.
Evaluating the Clusters
To evaluate the clustering algorithms, we will create an evaluation function to run on our list of scores.
With this function we can evaluate the list of scores acquired and plot out the values to determine the optimum number of clusters.
Based on both of these charts and evaluation metrics, the optimum number of clusters seem to be 12. For our final run of the algorithm, we will be using:
- CountVectorizer to vectorize the bios instead of TfidfVectorizer.
- Hierarchical Agglomerative Clustering instead of KMeans Clustering.
- 12 Clusters
With these parameters or functions, we will be clustering our dating profiles and assigning each profile a number to determine which cluster they belong to.
With everything ready, we can finally discover the clustering assignments for each dating profile.
Once we have run the code, we can create a new column containing the cluster assignments. This new DataFrame now shows the assignments for each dating profile.
We have successfully clustered our dating profiles! We can now filter our selection in the DataFrame by selecting only specific Cluster numbers. Perhaps more could be done but for simplicity’s sake this clustering algorithm functions well.
By utilizing an unsupervised machine learning technique such as Hierarchical Agglomerative Clustering, we were successfully able to cluster together over 5,000 different dating profiles. Feel free to change and experiment with the code to see if you could potentially improve the overall result. Hopefully, by the end of this article, you were able to learn more about NLP and unsupervised machine learning.
There are other potential improvements to be made to this project such as implementing a way to include new user input data to see who they might potentially match or cluster with. Perhaps create a dashboard to fully realize this clustering algorithm as a prototype dating app. There are always new and exciting approaches to continue this project from here and maybe, in the end, we can help solve people’s dating woes with this project.
Check out the following article to see how we created a web application for this dating app:
Link to the Web Application
-
-