The data set consists of 2017 Spotify songs downloaded from Spotify's API by a Kaggle user\footnote{McIntire, George 04/08-2017 -Spotify Song Attributes: \url{https://www.kaggle.com/geomack/spotifyclassification}}. We use ten of the attributes:
The data set consists of 2017 Spotify songs downloaded from Spotify's API by a Kaggle user\footnote{McIntire, George 04/08-2017 -Spotify Song Attributes: \url{https://www.kaggle.com/geomack/spotifyclassification}}. We use ten of the attributes:
\\
\\
\textit{ Acousticness,
\textit{
Acousticness,
Acousticness,
Danceability,
Danceability,
Duration in ms.,
Duration in ms.,
...
@@ -43,7 +43,7 @@ The data set consists of 2017 Spotify songs downloaded from Spotify's API by a K
...
@@ -43,7 +43,7 @@ The data set consists of 2017 Spotify songs downloaded from Spotify's API by a K
In the first half of the report, \textit{tempo} is used as a target variable such that only nine variables are considered.
In the first half of the report, \textit{tempo} is used as a target variable such that only nine variables are considered.
\section{Clustering: Are songs grouped by their tempo?}
\section{Clustering: Are songs grouped by their tempo?}
Before working with the clustering, tempo is taken out as this was shown to be a variable which explained a relatively high amount of variance in the first report and thus is suspected to account for some clustering in the data. Tempo is then thresholded into four groups: Those under 90 bpm, those between 90-100 bpm, those between 100-110 bpm and those over 110 bpm.
Before working with the clustering, tempo is taken out as this was shown to be a variable which explained a relatively high amount of variance in the first report and thus is suspected to account for some clustering in the data. The songs are then thresholded into four groups: Those with a tempo under 90 bpm, those between 90-100 bpm, those between 100-110 bpm and those with a tempo over 110 bpm.
\subsection{Hierachical clustering of the songs }
\subsection{Hierachical clustering of the songs }
For the dataset, we have decided to use complete linkage and euclidian distance.
For the dataset, we have decided to use complete linkage and euclidian distance.
The data is too high-dimensional to have theoretical considerations of cluster shapes before working with the data so the linkage and distance measure was chosen based on initial tests which showed that other measures resulted in multiple singleton groups. This might be due to outliers in the data set which is checked later in the report. The resulting clusters are plotted in a dendogram:
The data is too high-dimensional to have theoretical considerations of cluster shapes before working with the data so the linkage and distance measure was chosen based on initial tests which showed that other measures resulted in multiple singleton groups. This might be due to outliers in the data set which is checked later in the report. The resulting clusters are plotted in a dendogram:
...
@@ -105,7 +105,7 @@ The second plot using the second and third principal components is used to gain
...
@@ -105,7 +105,7 @@ The second plot using the second and third principal components is used to gain
\subsection{Evaluation of clusterings: Are tempo groups found?}
\subsection{Evaluation of clusterings: Are tempo groups found?}
%\textit{Evaluate the quality of the clusterings using GMM label information and for hierarchical clustering with the same number of clusters as in the GMM.}
%\textit{Evaluate the quality of the clusterings using GMM label information and for hierarchical clustering with the same number of clusters as in the GMM.}
To evaluate if the cluterings are similar to the premade clusterings of the tempo-attribute, three different similarities measures are used. These are the following: Rand index, Jaccard and NMI. The Rand Index similarity will typically be very high if there are many clusters. This is intuitively due to the fact that there is a lot of pairs of observations in different clusters rather than in the same cluster. This results in a Rand Index similarity close to one. Therefore the Jaccard index is also used as a similarity measure which disregard the pairs of observation in different cluster. The third measure is the normalized mutual information which is similar to both Jaccard and Rand Index. This similarity has a more theoretical background from information theory. It is based on quantifying the amount of information one cluste provides of the other cluster. The evaluation of the GMM and Hierachical Clustering are illustrated in the following table.
To evaluate if the cluterings are similar to the premade clusterings of the tempo-attribute, three different similarities measures are used. These are the following: Rand index, Jaccard and NMI. The Rand Index similarity will typically be very high if there are many clusters. This is intuitively due to the fact that there are a lot of pairs of observations in different clusters rather than in the same cluster. This results in a Rand Index similarity close to one. Therefore the Jaccard index is also used as a similarity measure which disregards the pairs of observations in different clusters. The third measure is the normalized mutual information, which is similar than both Jaccard and Rand Index. This similarity has a more theoretical background from information theory. It is based on quantifying the amount of information one cluste provides about the other cluster. The evaluation of the GMM and Hierachical Clustering are illustrated in the following table.
\begin{table}[H]
\begin{table}[H]
\centering
\centering
\begin{tabular}{l l r l r }
\begin{tabular}{l l r l r }
...
@@ -142,18 +142,18 @@ The leave-one-out Gaussian Kernel Density estimation is calculated with the foll
...
@@ -142,18 +142,18 @@ The leave-one-out Gaussian Kernel Density estimation is calculated with the foll
The kernel density estimation is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the spotify data-set the fitted GMM is a multivariate normal distribution due to the number of features in the date-set. The fitted GMM is then evaluated on the songs in order to calculate their individual density scores. An outlier in this model would then have a low density score, meaning the probability that a song fits into any of the clusters made by the GMM is low. The lowest density score songs are illustrated in bar chart \ref{GMchart}.
The kernel density estimation is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the spotify data-set, the fitted GMM is a multivariate normal distribution due to the number of features in the date-set. The fitted GMM is then evaluated on the songs in order to calculate their individual density scores. An outlier in this model would then have a low density score, meaning the probability of that a song fitting into any of the clusters made by the GMM is low. The lowest density score songs are illustrated in bar chart \ref{GMchart}.
\begin{figure}[H]
\begin{figure}[H]
\label{GMchart}
\label{GMchart}
\centering
\centering
\includegraphics[width=.8\linewidth]{out_KDE}
\includegraphics[width=.8\linewidth]{out_KDE}
\caption{text}
\caption{text}
\end{figure}\noindent
\end{figure}\noindent
The k-neighbor estimation detects which objects deviate from normal behavior. Firstly, the data is fitted to a KNN-model, with K-clusters. Then, the inverse distance density estimation is calculated through the following expression,
The k-neighbor estimation detects which objects deviate from normal behavior. Firstly, the data is fitted to a KNN-model with Kclusters. Then, the inverse distance density estimation is calculated through the following expression,
Where $ x^\prime\in N_{\mathbf{x \backslash i}(\mathbf{x_i, K})}$is the nearest K observations to $ x_i $ that are not $ x_i $.
Where $ x^\prime\in N_{\mathbf{x \backslash i}(\mathbf{x_i, K})}$are the nearest K observations to $ x_i $ that are not $ x_i $.
% TODO find a real K in an appropiate way.
% TODO find a real K in an appropiate way.
If the inverse density score of a specific song is low, the more likely it is to be an outlier. Therefore, the songs with the lowest inverse distance density estimation are illustrated in bar chart \ref{invest}.
If the inverse density score of a specific song is low, it is more likely to be an outlier. Therefore, the songs with the lowest inverse distance density estimation are illustrated in bar chart \ref{invest}.
\begin{figure}[H]
\begin{figure}[H]
\label{invest}
\label{invest}
\centering
\centering
...
@@ -171,18 +171,16 @@ If the $ \mathrm{ard} < 1 $ then the specific songs is likely to be an outlier.
...
@@ -171,18 +171,16 @@ If the $ \mathrm{ard} < 1 $ then the specific songs is likely to be an outlier.
\end{figure}
\end{figure}
\subsection{Three Scoring Methods for Outlier Detection}
\subsection{Three Scoring Methods for Outlier Detection}
\textit{Are there any outliers in the data according to the three methods?}\\
From the illustrations in the former exercise of ranking observations, it is evident that the leave-one-out Gaussian Kernel Density finds a bunch of different outliers, including "Southern Man", "The Nearness of You", "Music is the Answer", "Willing and Able" "Loner" and possibly "Oldie". It makes sense that this method finds a lot of outliers, as this density estimation is a rather basic multivariate normal distribution and the data is somewhat likely to lie far away from the density mean, as 4.54\% of the data lies outside of two standard deviations.
From the illustrations in the former exercise of ranking observations, it is evident that the leave-one-out Gaussian Kernel Density finds a bunch of different outliers, including "My Heart Will Go On", "The Nearness of You", "Southern Man", "Be My Valentine", "Remember" and "The Shadow of Your Smile". It makes sense that this method finds a lot of outliers, as this density estimation is a rather basic multivariate normal distribution and the data is somewhat likely to lie far away from the density mean, as 4.54\% of the data lies outside of two standard deviations.
Regarding the K-Nearest Neighbor (KNN) density, it can be shown that "My Heart Will Go On" and "The Nearness of You" are the most probable outliers. Meanwhile, the Average Relative Density (ARD) has been used to just find the one outlier "Redbone", as the density of this observation is a drastic enough amount lower than the other observation.
Regarding the illustration of K-Nearest Neighbor (KNN) density, "The Nearness of You", "Southern Man", "Willing and Able" and "Music is the Answer" are also the most probable outliers. Meanwhile, the Average Relative Density (ARD) has been used to just find the one outlier "Mask Off", as the density of this observation is a drastic enough amount lower than the other observation.
To find the most probable outliers in the dataset, the best cause of action is to compare the results of the three methods covered above. This comparison shows that the three different sorting methods for outlier detection find "My Heart Will Go On" and "The Nearness of You" to be the most probable outliers, as the Gaussian Kernel Density and KNN Density have these probable outliers in common. "Redbone" could also be an outlier, as this observation has the lowest density in the ARD but is not noted as such in any of the other methods.
To find the most probable outliers in the dataset, the best cause of action is to compare the results of the three methods covered above. This comparison shows that the three different sorting methods for outlier detection find "The Nearness of You" and "Southern Man" to be (some of ) the most probable outliers, as the Gaussian Kernel Density and KNN Density have these probable outliers in common. "Mask Off" could also be an outlier, as this observation has the lowest density in the ARD but is not noted as such in any of the other methods.
\section{Association Mining}
\section{Association Mining}
In this part of the report, the data has to be binarized in order to use the Apriori algorithm to find associations between observations. Since the data of this report is full of categorical variables, the data has been one-out-of-K encoded for three different intervals for each of the variables. The intervals for every variable is found as the 0.33 and 0.66 quantile as well as the max value.
In this part of the report, the data has to be binarized in order to use the Apriori algorithm to find associations between observations. Since the data of this report is full of categorical variables, the data has been one-out-of-K encoded for three different intervals for each of the variables. The intervals for every variable is found as the 0.33 and 0.66 quantile as well as the max value.
\subsection{Apriori Algorithm for Frequent Itemsets and Association Rules}
\subsection{Apriori Algorithm for Frequent Itemsets and Association Rules}
\textit{Find the frequent itemsets and the association rules with high confidence based on the results of the Apriori algorithm.}\\
In this section of the report, the results of running the Apriori algorithm on the data of the report are explored. To run the algorithm, a minimum support and confidence of 0.11 and 0.6 have been chosen after some amounts of exploration into the most optimal parameters. The results of the algorithm are shown in the table below.
In this section of the report, the results of running the Apriori algorithm on the data of the report are explored. To run the algorithm, a minimum support and confidence of 0.11 and 0.6 have been chosen after some amounts of exploration into the most optimal parameters. The results of the algorithm are shown in the table below.
\begin{table}[H]
\begin{table}[H]
...
@@ -205,8 +203,7 @@ In this section of the report, the results of running the Apriori algorithm on t
...
@@ -205,8 +203,7 @@ In this section of the report, the results of running the Apriori algorithm on t
\noindent All of the association rules seem to have a very low support, which could have to do with the fact that the data contains both a huge amount of attributes (27 after binarization) and a lot of possible combinations of these attributes (134,217,728 to be exact). The confidence, on the other hand, seems to be relatively high, which indicates that at least some of the association rules have some merit behind them.
\noindent All of the association rules seem to have a very low support, which could have to do with the fact that the data contains both a huge amount of attributes (27 after binarization) and a lot of possible combinations of these attributes (134,217,728 to be exact). The confidence, on the other hand, seems to be relatively high, which indicates that at least some of the association rules have some merit behind them.
\subsection{Interpretation of the Association Rules}
\subsection{Interpretation of the Association Rules}
\textit{Interpret the generated association rules.}\\
Analysing the association rules, it is obvious that most of them focus on a energy-loudness-valence relationship. This is most obvious in the first section of the table, which communicates that if and only if the energy is low, then the loudness is low. Meanwhile, if and only if the energy is high, then the loudness is also high. This trend also seems to be mirrored in the other two sections, further reinforcing the findings of the algorithm. Generally, Aproiri has found that quiet, passive songs have a tendency to have a high amount of acousticness, while energetic songs have low acousticness and high loudness. This tends to be true when comparing to songs like "Girlfriend" by "Avriel Lavigne", which is a very energic and loud song and thus, like the algorithm predicts, it is not a very acousticc song.
Analysing the association rules, it is obvious that most of them focus on a energy-loudness-valence relationship. This is most obvious in the first section of the table, which communicates that if and only if the energy is low, then the loudness is low. Meanwhile, if and only if the energy is high, then the loudness is also high. This trend also seems to be mirrored in the other two sections, further reinforcing the findings of the algorithm. Generally, Aproiri has found that quiet, passive songs have a tendency to have a high amount of acousticness, while energetic songs have low acousticness and high loudness. This seems to be logican when comparing to songs like "Girlfriend" by "Avriel Lavigne", which is a very energic and loud song and thus, like the algorithm predicts, it is not a very acousticc song.