Skip to content
Snippets Groups Projects
Commit 89ec06d2 authored by sorenmulli's avatar sorenmulli
Browse files

im stuff

parent 4d29be70
No related branches found
No related tags found
No related merge requests found
......@@ -26,9 +26,9 @@
%\thispagestyle{fancy}
%\tableofcontents
\section{Clustering}
\section{Clustering: Are songs grouped by their tempo?}
\subsection{Hierarchical Clustering}
\subsection{Hierachical clustering of the songs }
\textit{Interpret the suitable dissimilarity and distance measure for a hierarchical clustering of our dataset. Interpret the results of the clustering.} \\
For the dataset, we have decided to use complete linkage and euclidian distance, as this turns out to give the best separation into different clusters using the hierarchical clustering.
......@@ -49,17 +49,23 @@ For the dataset, we have decided to use complete linkage and euclidian distance,
\item Purple group: 1800 songs. Mean tempo group: \(3.3\)\\
Examples: \textit{Redbone, Master of None, Parallel Lines, Sneakin'}
\end{itemize}
From this superficial comparison, it seems that there might be a slight connection between the last two groups and songs with higher tempo. From the song names, it is hard to see genre seperations as we don't have access to a genre attribute in the data set. It might seem that music close to a rock genre is prevalent in the first group while the second group could be linked to instrumental, slower classical music. The third group can be linked to rap with multiple of the above songs from the artist \textit{Future} while the fourth could be seen as pop music or just music which is not contained in the other groups.
From this superficial comparison, it seems that there might be a slight connection between the last two groups and songs with higher tempo. From the song names, it is hard to see genre seperations as we don't have access to a genre attribute in the data set. The second group could be linked to instrumental, slower classical music while the third group can be linked to rap with multiple of the above songs from the artist \textit{Future} while the fourth could be seen as pop music or just music which is not contained in the other, more specialized groups. The first group seems very mixed and is hard to interpret.
\subsection{GMM and Component Estimation}
\subsection{Gaussian Mixture Model: Number and meaning of components}
\textit{Cluster the data by the Gaussian Mixture Model and find the number of clusters by cross-validation. Interpret the cluster centers.} \\
In this section, a clustering is performed using the Gaussian Mixture Model and cross-validation is used to find the appropriate number of clusters for this method. The negative log likelihood from the number of clusters is shown in the illustration below.
The more complex Gaussian Mixture Model is used to cluster the data and estimate the density of the songs in the 9-dimensional song attribute space.
As opposed to the hierarchical model, the GMM has a free, complexity-controlling parameter: the number of components corresponding to the number of multivariate normal distributions which are used to estimate the data density.
To estimate the optimal value of \(K\) and the complexity of the model one-level \(10\)-fold crossvalidation is run. To achieve reproducability, \texttt{random\_state} is held constant for both cross-validation and GMM initialization. For all models, the GMM is trained using the EM-algorithm with initialization with the K-means algorithm repeating the initialization three times and using the full covariance matrix for the GMM. Values of \(K\in \{1,2,...,11\}\) are tested and the minus log likelihood of the data under the model is used as the error of the model.
\begin{figure}[H]
\centering
\includegraphics[width = 0.6\linewidth]{gmmloss}
\end{figure}\noindent
\noindent From the illustration, it can be concluded that the most appropiate number of components is 9 where \(-\log \mathcal L = 12,178 \). Meanwhile, the negative log likelihood increases when the number of clusters is either increased (rises slightly) or decreased from 9, which can be understood as respectively over- and underfitting.
\\
\\
To interpret this optimal GMM with \(K=9\), the nine cluster centers \(\mu_{1..9}\) are extracted. As it is difficult to visualize this nine-dimensional data and tedious to tabulate all nine nine-dimensional cluster centers, the data, their tempo classes and GMM clusters and the nine centroids from the GMM's.
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{clusterfuck12}
......@@ -75,7 +81,7 @@ In this section, a clustering is performed using the Gaussian Mixture Model and
\noindent From the illustration, it is obvious that the most appropiate number of cluster i 9. Meanwhile, the negative log likelihood increases when the number of clusters is either increased or decreased from 9, which is to be expected.
\subsection{Evaluation of GMM and Hierarchical Clustering}
\subsection{Evaluation of clusterings: Are tempo groups found?}
%\textit{Evaluate the quality of the clusterings using GMM label information and for hierarchical clustering with the same number of clusters as in the GMM.}
To evaluate if the cluterings are similar to the premade clusterings of the tempo-attribute, three different similarities measures are used. These are the following: Rand index, Jaccard and NMI. The Rand Index similarity will typically be very high if there are many clusters. This is intuitively due to the fact that there is a lot of pairs of observations in different clusters rather than in the same cluster. This results in a Rand Index similarity close to one. Therefore the Jaccard index is also used as a similarity measure which disregard the pairs of observation in different cluster. The third measure is the normalized mutual information which is similar to both Jaccard and Rand Index. This similarity has a more theoretical background from information theory. It is based on quantifying the amount of information one cluster(s) provides of the other cluster. The evaluation of the GMM and Hierachical Clustering are illustrated in the following table.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment