Skip to content
Snippets Groups Projects
Commit 8ee21e46 authored by sorenmulli's avatar sorenmulli
Browse files

I did some final touches

parent 615d64ea
No related branches found
No related tags found
No related merge requests found
......@@ -40,19 +40,18 @@ The data set consists of 2017 Spotify songs downloaded from Spotify's API by a K
Tempo,
Valence,
\\}
In the first half of the report, \textit{tempo} is used as a target variable such that only nine variables are considered. These nine variables are standardized by subtracting mean and dividing by mean error.
\section{Clustering: Are songs grouped by their tempo?}
Before working with the clustering, tempo is taken out as this was shown to be a variable which explained a relatively high amount of variance in the first report and thus is suspected to account for some clustering in the data. The songs are then thresholded into four groups: Those with a tempo under 90 bpm, those between 90-100 bpm, those between 100-110 bpm and those with a tempo over 110 bpm.
Before working with the clustering, tempo is taken out as this was shown to be a variable which explained a relatively high amount of variance in the first report and thus is suspected to account for some clustering in the data. The songs are then thresholded into four groups: Those with a tempo under 90 bpm, those between 90-100 bpm, those between 100-110 bpm and those with a tempo over 110 bpm. The nine explaining variables are standardized by subtracting the mean and dividing by the mean error.
\subsection{Hierachical clustering of the songs }
For the dataset, we have decided to use complete linkage and euclidian distance.
The data is too high-dimensional to have theoretical considerations of cluster shapes before working with the data so the linkage and distance measure was chosen based on initial tests which showed that other measures resulted in multiple singleton groups. This might be due to outliers in the data set which is checked later in the report. The resulting clusters are plotted in a dendogram:
For the hierachical clustering of the dataset, it was decided to use complete linkage and measure distance by euclidian distance.
The data is too high-dimensional to have theoretical considerations of cluster shapes before working with the data so the linkage and distance measure were chosen based on initial tests which showed that other measures resulted in multiple singleton groups. This might be due to outliers in the data set which is checked later in the report. The resulting clusters are plotted in a dendogram:
\begin{figure}[H]
\centering
\includegraphics[width = 0.8\linewidth]{dendogram}
\caption{Dendogram of the 2017 songs with complete linkage. Four groups are noted high up pin the hierarchy one of them including most of the data. }
\end{figure}
\noindent From the figure, it is obvious that the data is split into three major clusters until the last few connections of the dendrogram. To understand these four major groups and get an idea of whether they are somewhat divided after tempo (a rigorous test of this is done in 1.3), examples from each group and the mean tempo group (1 being low tempo, 4 being high tempo) of the songs is shown.
\noindent From the figure, it is obvious that the data is split into four major clusters until the last few connections of the dendrogram. To understand these four groups and get an idea of whether they are somewhat divided after tempo (a rigorous test of this is done in 1.3), examples from each group and the mean tempo group (1 being low tempo, 4 being high tempo) of the songs is shown.
\begin{itemize}
\item Green group: 17 songs. Mean tempo group: \(2.7\)\\
......@@ -89,9 +88,9 @@ To interpret this optimal GMM with \(K=9\), the nine cluster centers \(\mu_{1..9
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{clusterfuck23}
\caption{The data and the clusters plotted on the second and third principal component. In the third principal component cluster 6 and cluster 8 correspond somewhat to principal component directions.}
\caption{The data and the clusters plotted on the second and third principal components. In the third principal component cluster 6 and cluster 8 correspond somewhat to principal component directions.}
\end{figure}\noindent
From the first plot, it is observed that most of the data clusters in a raisin bun plot in the middle showing that many of the clusters captures differences that are too small to be shown in these two principal components. It is seen, though, that the fourth cluster corresponds to high values of the first principal component which we in the first report was linked to high acousticness, low energy and minor key. A direction upwards in the second principal component can also be linked with the third cluster and songs in the fourth class corresponding to high energy.
From the first plot, it is observed that most of the data clusters in a raisin bun plot in the middle showing that many of the clusters captures differences that are too small to be shown in these two principal components. It is seen, though, that the fourth cluster corresponds to high values of the first principal component which in the first report was linked to high acousticness, low energy and minor key. A direction upwards in the second principal component can also be linked with the third cluster and songs in the fourth class corresponding to high energy.
The second plot using the second and third principal components is used to gain more information but is also very grouped around (0,0). The sixth cluster can though be linked to a low value of the third principal component which was linked with short, loud, high-energy songs with little speech. This cluster seems to correspond somewhat well to the fourth class: the songs with highest tempo.
......@@ -105,7 +104,9 @@ The second plot using the second and third principal components is used to gain
\subsection{Evaluation of clusterings: Are tempo groups found?}
%\textit{Evaluate the quality of the clusterings using GMM label information and for hierarchical clustering with the same number of clusters as in the GMM.}
To evaluate if the cluterings are similar to the premade clusterings of the tempo-attribute, three different similarities measures are used. These are the following: Rand index, Jaccard and NMI. The Rand Index similarity will typically be very high if there are many clusters. This is intuitively due to the fact that there are a lot of pairs of observations in different clusters rather than in the same cluster. This results in a Rand Index similarity close to one. Therefore the Jaccard index is also used as a similarity measure which disregards the pairs of observations in different clusters. The third measure is the normalized mutual information, which is similar than both Jaccard and Rand Index. This similarity has a more theoretical background from information theory. It is based on quantifying the amount of information one cluste provides about the other cluster. The evaluation of the GMM and Hierachical Clustering are illustrated in the following table.
To evaluate if the cluterings are similar to the premade clusterings of the tempo-attribute, three different similarities measures are used. These are the following: Rand index, Jaccard and NMI. The Rand Index similarity will typically be very high if there are many clusters. This is intuitively due to the fact that there are a lot of pairs of observations in different clusters rather than in the same cluster. This results in a Rand Index similarity close to one. Therefore the Jaccard index is also used as a similarity measure which disregards the pairs of observations in different clusters. The third measure is the normalized mutual information.
%, which is similar than both Jaccard and Rand Index.
This similarity score has a more theoretical background from information theory as it is based on quantifying the amount of information one cluster provides about the other cluster. The evaluation of the GMM and Hierachical Clustering are illustrated in the following table.
\begin{table}[H]
\centering
\begin{tabular}{l l r l r }
......@@ -139,7 +140,7 @@ To understand whether some songs in this 2017 large data set comprised of an arb
\subsection{Ranking songs after typicality}
%\textit{Rank the observations in terms of leave-one-out Gaussian Kernel Density, KNN Density %and KNN Average Relative Density}
\paragraph{Kernel density estimation }is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the Spotify data-set, this fitted GMM is a multivariate normal distribution with a diagonal covariance matrix due to the number of features in the data set.
\paragraph{Kernel density estimation }is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the Spotify data-set, this fitted GMM is a nine dimensional normal distribution with a diagonal covariance matrix due to the number of features in the data set.
The leave-one-out Gaussian Kernel Density estimation is calculated with the following expression,
\begin{equation}\label{key}
p(\mathbf{x})=\sum_{n=1}^{N} \frac{1}{N} \mathcal{N}\left(\mathbf{x} | \mathbf{x}_{n}, \sigma^{2} \mathbf{I}\right)
......@@ -161,13 +162,13 @@ Where $ x^\prime \in N_{\mathbf{x \backslash i}(\mathbf{x_i, K})} $ are the nea
If the inverse density score of a specific song is low, it is more likely to be an outlier. Therefore, the songs with the lowest inverse distance density estimation are illustrated in figure \ref{invest}.
\begin{figure}[H]
\centering
\includegraphics[width=.77\linewidth]{out_KNNdes} \caption{The ten songs with lowest density under the KNN model. The first song "The Nearness of You"\ seems like an outlier.}
\includegraphics[width=.77\linewidth]{out_KNNdes} \caption{The ten songs with lowest density under the KNN model. The first song "The Nearness of You"\ seems like a possible outlier.}
\label{invest}
\end{figure}
\noindent
\paragraph{The relative density estimation} also takes into account the density of other songs in a certain songs vicinity. The same KNN-model that is fitted to the data also with $ K=50 $. Furthermore, the relative density can be calculated with the following expression,
\[ \mathrm{ard }_{\mathbf X} (\mathbf x_i, K) = \frac{\mathrm{density }_{\mathbf X_{\backslash i}} (\mathbf x_i, K)}{\frac{1}{K} \sum_{\mathbf{x}_j\in N_{\mathbf x \backslash i}(\mathbf x_i, K)}\mathrm {density}_{\mathbf X_{\backslash j}}(\mathbf x_j,\mathbf{x^\prime})} \]
If the $ \mathrm{ard} < 1 $ then the specific songs is likely to be an outlier. The songs with lowest relative density score are illustrated in \ref{relden}.
If the $ \mathrm{ard} < 1 $ then the specific songs can be seen as a candidate to be an outlier. The songs with lowest relative density score are illustrated in \ref{relden}.
\begin{figure}[H]
\centering
......@@ -179,15 +180,15 @@ If the $ \mathrm{ard} < 1 $ then the specific songs is likely to be an outlier.
\subsection{Are there any outliers?}
From the illustrations in the former exercise of ranking observations, it is evident that the leave-one-out Gaussian Kernel Density finds a number of songs which seem like outliers on this log scale including "Southern Man", "The Nearness of You", "Music is the Answer", "Willing and Able" "Loner" and possibly "Oldie". It seems that model chooses all the ten plotted songs as candidate outliers as they all take on the probability of \(7.9\ctp{-5}\) while the mean estimated probability over all the songs is found to be \(3.3\ctp{-3}\)
Regarding the illustration of K-Nearest Neighbor (KNN) density, "The Nearness of You", "Southern Man", "Willing and Able" and "Music is the Answer" and "Viola Sonata" are the most probable outliers. Meanwhile, the Average Relative Density (ARD) has been used to find what seems like only three possible outliers "Mask Off", "Redbone", "Master Of None".
Regarding the illustration of K-Nearest Neighbor (KNN) density, "The Nearness of You", "Southern Man", "Willing and Able" and "Music is the Answer" and "Viola Sonata" are the most probable outliers. Meanwhile, the Average Relative Density (ARD) has been used to find what seems like only three clear outlier candidates "Mask Off", "Redbone", "Master Of None".
To find the most probable outliers in the dataset, the best cause of action is to compare the results of the three methods covered above. This comparison shows that the three different sorting methods for outlier detection find "The Nearness of You" and "Southern Man" to be (some of) the more probable outliers, as the Gaussian Kernel Density and KNN Density have these probable outliers in common. "Mask Off" could also be an outlier, as this observation has the lowest density in the ARD but is not noted as such in any of the other methods.
It is noted that the results from the ARD differ significantly from the results from the other two indiciting that takin the relative density into account makes a significant difference. This can be due to a noisy data set with very inhomogeneous data density. It could also be a general result for high dimensional data where data points in general are quite far from each other.
It is noted that the results from the ARD differ significantly from the results from the other two indiciting that taking the relative density into account makes a significant difference. This can be due to a noisy data set with very inhomogeneous data density. It could also be a general result for high dimensional data where data points in general are quite far from each other.
\\\\
When looking at outliers in a probabilistic sense, some of these candidates can be seen as outliers. However, if outliers are understood as coming from erroneous data or coming from another data distribution as in Hawkin's definition, it cannot be concluded from this exercise that the data contains any outliers: No song consistently stands dramatically out and the found candidates could just be edge cases from the same distribution as every probability distribution is expected to have some kind of a tail.
\section{Association Mining: What goes together in a song?}
In this part of the report, the data has to be binarized in order to use the mathematical framework of itemsets and transactions to examine the data and find associations between observations. Since the data of this report is full of continuous variables, the data has been one-out-of-K. Three different intervals for each of the variables is chosen with limits at the \(\frac13 \text{ and } \frac 23\) quantiles.
In this part of the report, the data has to be binarized in order to use the mathematical framework of itemsets and transactions to examine the data and find associations between observations. Since the data of this report is full of continuous variables, the data has been one-out-of-K encoded. Three different intervals for each of the variables is chosen with thresholds at the \(\frac13 \text{ and } \frac 23\)- quantiles thus dividing each variable into three binary attributes for whether it is low, medium or high.
\subsection{Apriori Algorithm for Frequent Itemsets and Association Rules}
In this section of the report, the results of running the Apriori algorithm on the data of the report are explored. To run the algorithm, a minimum support and confidence of 0.11 and 0.6 have been chosen after some amounts of exploration into the most optimal parameters. Slightly higher thresholds of support yielded very few results while slightly lower thresholds of confidence made the amount of associations explode. The results of the algorithm are shown in the table below.
......@@ -210,16 +211,16 @@ In this section of the report, the results of running the Apriori algorithm on t
\caption{The nine found association rules with the above stated minimum confidence and support. The table has been divided for readability.}
\end{table}
\noindent All of the association rules seem to have a very low support, which could have to do with the fact that the data contains both a huge amount of attributes: 27, after binarization, and a lot of possible combinations of these attributes:, 134,217,728 to be exact. The confidence, on the other hand, seems to be relatively high, which indicates that the association rules have some merit behind them.
\noindent All of the association rules seem to have a very low support, which could have to do with the fact that the data contains both a huge amount of attributes: 30, after binarization, and a lot of possible combinations of these attributes:, 134,217,728 to be exact. The confidence, on the other hand, seems to be relatively high, which indicates that the association rules have some merit behind them.
\subsection{Do the rules make sense?}
Analysing the association rules, it seems that three distinct sets of rules are found:
\begin{itemize}
\item The first part of the table communicates an equivalence relationship between low energy and low loudness \textit{and} high energy and high loudness which is very logical and one would expect oneself to be likely to classify a loud song as energetic.
\item The first part of the table communicates an equivalence relationship between low energy and low loudness \textit{and} high energy and high loudness which is very logical as one would expect humans to be likely to classify a loud song as energetic.
\item The next section of the table shows tells part of the same story but also includes acousticness and connects low acousticness high accousticness to low loudness and low energy.
\item The third section connects valence, the measure of how positive a song is, to energy and loudness. Low valence together with low energy often results in low loudness and also the other way around; low loudness and valence results in low energy.
\item The third section connects valence-- the measure of how positive a song is -- to energy and loudness. Low valence together with low energy often results in low loudness and also the other way around; low loudness and valence results in low energy.
\end{itemize}
Generally, the apriori algorithm has found some pattern of quiet, low-energy songs having a tendency to have a high amount of acousticness, while energetic often are electronic and loud. This relationship can be recognized from general experience with an example from the data set being "Girlfriend" by "Avriel Lavigne", which is a very energic and loud song and thus, like the algorithm predicts, it is not a very acousticc song.
Generally, the apriori algorithm has found some pattern of quiet, low-energy songs having a tendency to have a high amount of acousticness, while energetic often are electronic and loud. This relationship can be recognized from general experience with an example from the data set being "Girlfriend" by "Avriel Lavigne, which is a very energic and loud song and thus, like the algorithm predicts, is not a very acoustic song.
\end{document}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment