Merge branch 'master' of https://lab.compute.dtu.dk/s183917/ml_data

1784ef8f · sorenmulli · 8ee21e46 · ca327df4 · 1784ef8f
Commit 1784ef8f authored 5 years ago by sorenmulli
--- a/docs/report/tex/report3.tex
+++ b/docs/report/tex/report3.tex
@@ -63,13 +63,13 @@ The data is too high-dimensional to have theoretical considerations of cluster s
 	\item Purple group: 1800 songs. Mean tempo group: \(3.3\)\\
 	Examples: \textit{Redbone, Master of None, Parallel Lines, Sneakin'}
 \end{itemize}
-From this superficial comparison, it seems that there might be a slight connection between the last two groups and songs with higher tempo. From the song names, it is hard to see genre seperations as we don't have access to a genre attribute in the data set. The second group could be linked to instrumental, slower classical music while the third group can be linked to rap with multiple of the above songs from the artist \textit{Future} while the fourth could be seen as pop music or just music which is not contained in the other, more specialized groups. The first group seems very mixed and is hard to interpret.
+From this superficial comparison, it seems that there might be a slight connection between the last two groups and songs with higher tempo. From the song names, it is hard to see genre seperations as we don't have access to a genre attribute in the data set. The second group could be linked to instrumental, slower classical music, while the third group can be linked to rap with multiple of the above songs from the artist \textit{Future}. The fourth could be seen as pop music or just music which is not contained in the other, more specialized groups. The first group seems very mixed and is hard to interpret.

 \subsection{Gaussian Mixture Model: Number and meaning of components}
 The more complex Gaussian Mixture Model is used to cluster the data and estimate the density of the songs in the 9-dimensional song attribute space. 
-As opposed to the hierarchical model, the GMM has a free, complexity-controlling parameter: the number of components corresponding to the number of multivariate normal distributions which are used to estimate the data density. 
+As opposed to the hierarchical model, the GMM has a free, complexity-controlling parameter: the number of components corresponding to the number of multivariate normal distributions used to estimate the data density. 

-To estimate the optimal value  of \(K\) and the complexity of the model one-level \(10\)-fold crossvalidation is run. To achieve reproducability, \texttt{random\_state} is held constant for both cross-validation and GMM initialization. For all models, the GMM is trained using the EM-algorithm with initialization with the K-means algorithm repeating the initialization three times and using the full covariance matrix for the GMM. Values of \(K\in \{1,2,...,11\}\) are tested and the minus log likelihood of the data under the model is used as the error of the model.
+To estimate the optimal value  of \(K\) and the complexity of the model, one-level \(10\)-fold crossvalidation is run. To achieve reproducibility, \texttt{random\_state} is held constant for both cross-validation and GMM initialization. For all models, the GMM is trained using the EM-algorithm with initialization with the K-means algorithm repeating the initialization three times and using the full covariance matrix for the GMM. Values of \(K\in \{1,2,...,11\}\) are tested and the negative log likelihood of the data under the model is used as the error of the model.
 \begin{figure}[H]
 	\centering
 	\includegraphics[width = 0.6\linewidth]{gmmloss}
@@ -79,20 +79,20 @@ To estimate the optimal value  of \(K\) and the complexity of the model one-leve
 \noindent From the results, it can be concluded that the most appropiate number of components is 9 where \(-\log \mathcal L = 12,178 \). Meanwhile, the negative log likelihood increases when the number of clusters is either increased (rises slightly) or decreased from 9, which can be understood as respectively over- and underfitting.
 \\
 \\
-To interpret this optimal GMM with \(K=9\), the nine cluster centers \(\mu_{1..9}\) are extracted. As it is difficult to visualize this nine-dimensional data and tedious to tabulate all nine nine-dimensional cluster centers, the data, their tempo classes and GMM clusters and the nine centroids are transformed into a principal component space.
+To interpret this optimal GMM with \(K=9\), the nine cluster centers \(\mu_{1..9}\) are extracted. It is difficult to visualize this nine-dimensional data and tedious to tabulate all nine nine-dimensional cluster centers, so the data, their tempo classes and GMM clusters and the nine centroids are transformed into a principal component space.
 \begin{figure}[H]
 	\centering
 	\includegraphics[width=\linewidth]{clusterfuck12}	
-	\caption{The data and the clusters plotted on the first two principal components. Most of the data is clustered around (0,0) but cluster 4 and cluster 3 are discernable.}	
+	\caption{The data and the clusters plotted on the first two principal components. Most of the data is clustered around (0,0) but cluster 4 and cluster 3 are discernible.}	
 \end{figure}
 \begin{figure}[H]
 	\centering 
 	\includegraphics[width=\linewidth]{clusterfuck23}
-		\caption{The data and the clusters plotted on the second and third principal components. In the third principal component cluster 6 and cluster 8 correspond somewhat to principal component directions.}	
+		\caption{The data and the clusters plotted on the second and third principal component. In the third principal component, cluster 6 and cluster 8 correspond somewhat to principal component directions.}	
 \end{figure}\noindent 
-From the first plot, it is observed that most of the data clusters in a raisin bun plot in the middle showing that many of the clusters captures differences that are too small to be shown in these two principal components. It is seen, though, that the fourth cluster corresponds to high values of the first principal component which in the first report was linked to high acousticness, low energy and minor key. A direction upwards in the second principal component can also be linked with the third cluster and songs in the fourth class corresponding to high energy.
+From the first plot, it is observed that most of the data clusters in a raisin bun plot in the middle showing that many of the clusters capture differences that are too small to be shown in these two principal components. It is seen, though, that the fourth cluster corresponds to high values of the first principal component which, in the first report, was linked to high acousticness, low energy and a minor key. A direction upwards in the second principal component can also be linked with the third cluster, and songs in the fourth class corresponding to high energy.

-The second plot using the second and third principal components is used to gain more information but is also very grouped around (0,0). The sixth cluster can though be linked to a low value of the third principal component which was linked with short, loud, high-energy songs with little speech. This cluster seems to correspond somewhat well to the fourth class: the songs with highest tempo.
+The second plot using the second and third principal components is used to gain more information but is also very grouped around (0,0). The sixth cluster can nonetheless be linked to a low value of the third principal component, which was linked with short, loud, high-energy songs with little speech. This cluster seems to correspond somewhat well to the fourth class: the songs with highest tempo.



@@ -121,13 +121,13 @@ This similarity score has a more theoretical background from information theory
 \label{simtab}
 \caption{The table shows the similarity scores between the model clusters and the target clusters}
 \end{table} \noindent
-To see if the Hierachical clustering is similar to the Gaussian mixture model the same similarity measures are calculated for the clusterings between the two models. The similarity score of the Hierachical clusters and GMM clusters are measured as:
+To see if the Hierachical clustering is similar to the Gaussian mixture model, the same similarity measures are calculated for the clusterings between the two models. The similarity score of the Hierachical clusters and GMM clusters are measured as:
 \begin{align*}
 &\mathrm{Rand \ Index}: \quad 0.7241 &&
 \mathrm{Jaccard}: \quad 0.1460 &&&
 \mathrm{NMI}: \quad 0.2302
 \end{align*}
-This suggest that the clusters of the GMM and Hierachical models are mutually more similar than they are to the target clusters. Overall the similarity scores seen in table \eqref{simtab} shows that the models found does not describe the target clustering of tempo well. This does not necessarily imply that the GMM and Hierchical clustering models does not cluster the data well, but rather that clustering tempo does not explain the data well. 
+This suggests that the clusters of the GMM and Hierachical models are mutually more similar than they are to the target clusters. Overall the similarity scores seen in table \eqref{simtab} shows that the models found does not describe the target clustering of tempo well. This does not necessarily imply that the GMM and Hierchical clustering models does not cluster the data well, but rather that clustering tempo does not explain the data well. 

 %Celebratory behavior is therefore more applicable to monkeys than human robots.
 %RAND BOI:0.7241369982135971
@@ -156,7 +156,7 @@ The lowest density score songs are illustrated in figure \ref{GMchart}.
 \end{figure} \noindent
 \paragraph{K neighbours density estimation} detects which objects deviate from normal behavior. Firstly, the data is fitted to a KNN-model with K clusters. Then, the inverse distance density estimation is calculated through the following expression, 
 \[\mathrm{density }_{\mathbf X_{\backslash i}} (\mathbf x_i, K) = \frac{1}{\frac{1}{K} \sum_{\mathbf{x}^\prime \in N_{\mathbf x \backslash i}(\mathbf x_i, K)}d(\mathbf x_i,\mathbf{x^\prime})}  \]
-Where $ x^\prime \in  N_{\mathbf{x \backslash i}(\mathbf{x_i, K})} $ are the nearest K observations to $ x_i $ that are not $ x_i $. \(K=50\) is chosen somewhat arbitrarily as no clear way to evaluate the model is available and in the previous report, the KNN \textit{classifier} was found to be optimal via cross validation at \(K=50\).
+Where $ x^\prime \in  N_{\mathbf{x \backslash i}(\mathbf{x_i, K})} $ are the nearest K observations to $ x_i $ that are not $ x_i $. \(K=50\) is chosen somewhat arbitrarily, as no clear way to evaluate the model is available and in the previous report, the KNN \textit{classifier} was found to be optimal via cross validation at \(K=50\).


 If the inverse density score of a specific song is low, it is more likely to be an outlier. Therefore, the songs with the lowest inverse distance density estimation are illustrated in figure \ref{invest}.