workies

dbc5d5a7 · sorenmulli · 5fb8885a · dbc5d5a7 · dbc5d5a7
Commit dbc5d5a7 authored 5 years ago by sorenmulli
--- a/docs/report/tex/report3.tex
+++ b/docs/report/tex/report3.tex
@@ -25,18 +25,33 @@
 \maketitle
 %\thispagestyle{fancy}
 %\tableofcontents
+\section{The Data: 2017 songs, 10 attributes}
+The data set consists of 2017 Spotify songs downloaded from Spotify's API by a Kaggle user\footnote{McIntire, George 04/08-2017 -Spotify Song Attributes: \url{https://www.kaggle.com/geomack/spotifyclassification}}. We use ten of the attributes:
+\\
+\textit{	Acousticness,
+	Acousticness,
+	Danceability,
+	Duration in ms.,
+	Energy,
+	Instrumentalness,
+	Liveness,
+	Loudness,
+	Speechiness,
+	Tempo,
+	Valence,
+\\}
+In the first half of the report, \textit{tempo} is used as a target variable such that only nine variables are considered.

 \section{Clustering: Are songs grouped by their tempo?}
-
+Before working with the clustering, tempo is taken out as this was shown to be a variable which explained a relatively high amount of variance in the first report and thus is suspected to account for some clustering in the data. Tempo is then thresholded into four groups: Those under 90 bpm, those between 90-100 bpm, those between 100-110 bpm and those over 110 bpm.
 \subsection{Hierachical clustering of the songs }
-\textit{Interpret the suitable dissimilarity and distance measure for a hierarchical clustering of our dataset. Interpret the results of the clustering.} \\
-For the dataset, we have decided to use complete linkage and euclidian distance, as this turns out to give the best separation into different clusters using the hierarchical clustering.
-
+For the dataset, we have decided to use complete linkage and euclidian distance. 
+The data is too high-dimensional to have theoretical considerations of cluster shapes before working with the data so the linkage and distance measure was chosen based on initial tests which showed that other measures resulted in multiple singleton groups. This might be due to outliers in the data set which is checked later in the report. The resulting clusters are plotted in a dendogram:
 \begin{figure}[H]
 	\centering
-	\includegraphics[width = 0.6\linewidth]{dendogram}
+	\includegraphics[width = 0.8\linewidth]{dendogram}
+	\caption{Dendogram of the 2017 songs with complete linkage. Four groups are noted high up pin the hierarchy one of them including most of the data. }
 \end{figure}
-
 \noindent From the figure, it is obvious that the data is split into three major clusters until the last few connections of the dendrogram. To understand these four major groups and get an idea of whether they are somewhat divided after tempo (a rigorous test of this is done in 1.3), examples from each group and the mean tempo group (1 being low tempo, 4 being high tempo) of the songs is shown.

 \begin{itemize}
@@ -52,7 +67,6 @@ For the dataset, we have decided to use complete linkage and euclidian distance,
 From this superficial comparison, it seems that there might be a slight connection between the last two groups and songs with higher tempo. From the song names, it is hard to see genre seperations as we don't have access to a genre attribute in the data set. The second group could be linked to instrumental, slower classical music while the third group can be linked to rap with multiple of the above songs from the artist \textit{Future} while the fourth could be seen as pop music or just music which is not contained in the other, more specialized groups. The first group seems very mixed and is hard to interpret.

 \subsection{Gaussian Mixture Model: Number and meaning of components}
-\textit{Cluster the data by the Gaussian Mixture Model and find the number of clusters by cross-validation. Interpret the cluster centers.} \\
 The more complex Gaussian Mixture Model is used to cluster the data and estimate the density of the songs in the 9-dimensional song attribute space. 
 As opposed to the hierarchical model, the GMM has a free, complexity-controlling parameter: the number of components corresponding to the number of multivariate normal distributions which are used to estimate the data density. 

@@ -60,31 +74,35 @@ To estimate the optimal value  of \(K\) and the complexity of the model one-leve
 \begin{figure}[H]
 	\centering
 	\includegraphics[width = 0.6\linewidth]{gmmloss}
+	\caption{The negative log likelihood of the clusters under the GMM with \(K\) components. The model seems to stop improving after \(K=9\) and has a soft minimum at 9.}
 \end{figure}\noindent 

-\noindent From the illustration, it can be concluded that the most appropiate number of components is 9 where \(-\log \mathcal L = 12,178 \). Meanwhile, the negative log likelihood increases when the number of clusters is either increased (rises slightly) or decreased from 9, which can be understood as respectively over- and underfitting.
+\noindent From the results, it can be concluded that the most appropiate number of components is 9 where \(-\log \mathcal L = 12,178 \). Meanwhile, the negative log likelihood increases when the number of clusters is either increased (rises slightly) or decreased from 9, which can be understood as respectively over- and underfitting.
 \\
 \\
-To interpret this optimal GMM with \(K=9\), the nine cluster centers \(\mu_{1..9}\) are extracted. As it is difficult to visualize this nine-dimensional data and tedious to tabulate all nine nine-dimensional cluster centers, the data, their tempo classes and GMM clusters and the nine centroids from the GMM's.
+To interpret this optimal GMM with \(K=9\), the nine cluster centers \(\mu_{1..9}\) are extracted. As it is difficult to visualize this nine-dimensional data and tedious to tabulate all nine nine-dimensional cluster centers, the data, their tempo classes and GMM clusters and the nine centroids are transformed into a principal component space.
 \begin{figure}[H]
 	\centering
 	\includegraphics[width=\linewidth]{clusterfuck12}	
+	\caption{The data and the clusters plotted on the first two principal components. Most of the data is clustered around (0,0) but cluster 4 and cluster 3 are discernable.}	
 \end{figure}
 \begin{figure}[H]
 	\centering 
 	\includegraphics[width=\linewidth]{clusterfuck23}
+		\caption{The data and the clusters plotted on the second and third principal component. In the third principal component cluster 6 and cluster 8 correspond somewhat to principal component directions.}	
 \end{figure}

+
+
 %[23487.56607812 18127.38559714 14899.35424052 14243.28471954
 %13453.47497572 13109.79453539 12855.24139837 12576.10894894
 %12178.15906456 12210.34021972 12243.54429977]

-\noindent From the illustration, it is obvious that the most appropiate number of cluster i 9. Meanwhile, the negative log likelihood increases when the number of clusters is either increased or decreased from 9, which is to be expected.

 \subsection{Evaluation of clusterings: Are tempo groups found?}
 %\textit{Evaluate the quality of the clusterings using GMM label information and for hierarchical clustering with the same number of clusters as in the GMM.} 

-To evaluate if the cluterings are similar to the premade clusterings of the tempo-attribute, three different similarities measures are used. These are the following: Rand index, Jaccard and NMI. The Rand Index similarity will typically be very high if there are many clusters. This is intuitively due to the fact that there is a lot of pairs of observations in different clusters rather than in the same cluster. This results in a Rand Index similarity close to one. Therefore the Jaccard index is also used as a similarity measure which disregard the pairs of observation in different cluster. The third measure is the normalized mutual information which is similar to both Jaccard and Rand Index. This similarity has a more theoretical background from information theory. It is based on quantifying the amount of information one cluster(s) provides of the other cluster. The evaluation of the GMM and Hierachical Clustering are illustrated in the following table. 
+To evaluate if the cluterings are similar to the premade clusterings of the tempo-attribute, three different similarities measures are used. These are the following: Rand index, Jaccard and NMI. The Rand Index similarity will typically be very high if there are many clusters. This is intuitively due to the fact that there is a lot of pairs of observations in different clusters rather than in the same cluster. This results in a Rand Index similarity close to one. Therefore the Jaccard index is also used as a similarity measure which disregard the pairs of observation in different cluster. The third measure is the normalized mutual information which is similar to both Jaccard and Rand Index. This similarity has a more theoretical background from information theory. It is based on quantifying the amount of information one cluste provides of the other cluster. The evaluation of the GMM and Hierachical Clustering are illustrated in the following table. 
 \begin{table}[H]
 	\centering
 	\begin{tabular}{l l r l r }
@@ -101,11 +119,11 @@ To evaluate if the cluterings are similar to the premade clusterings of the temp
 \end{table} \noindent
 To see if the Hierachical clustering is similar to the Gaussian mixture model the same similarity measures are calculated for these two clusters. The similarity score of the Hierachical clusters and GMM clusters measured as:
 \begin{align*}
-\mathrm{Rand \ Index}: \quad 0.7241 \\
-\mathrm{Jaccard}: \quad 0.1460 \\
+&\mathrm{Rand \ Index}: \quad 0.7241 &&
+\mathrm{Jaccard}: \quad 0.1460 &&&
 \mathrm{NMI}: \quad 0.2302
 \end{align*}
-This suggest that the clusters of the GMM and Hierachical models are mutually more similar than they are to the target clusters. Overall the similarity scores seen in table \eqref{simtab} shows that the models found does not describe the target clustering of tempo very well. This does not imply that the GMM and Hierchical clustering models does not cluster the data well, but rather that clustering tempo does not explain the data well. 
+This suggest that the clusters of the GMM and Hierachical models are mutually more similar than they are to the target clusters. Overall the similarity scores seen in table \eqref{simtab} shows that the models found does not describe the target clustering of tempo very well. This does not necessarily imply that the GMM and Hierchical clustering models does not cluster the data well, but rather that clustering tempo does not explain the data well. 

 %Celebratory behavior is therefore more applicable to monkeys than human robots.
 %RAND BOI:0.7241369982135971
@@ -125,7 +143,7 @@ p(\mathbf{x})=\sum_{n=1}^{N} \frac{1}{N} \mathcal{N}\left(\mathbf{x} | \mathbf{x
 \begin{figure}[H]
 	\label{GMchart}
 	\centering
-	\includegraphics[width=\linewidth]{out_KDE}	
+	\includegraphics[width=.8\linewidth]{out_KDE}	
 	\caption{text}
 \end{figure} \noindent
 The k-neighbor estimation detects which objects deviate from normal behavior. Firstly, the data is fitted to a KNN-model, with K-clusters. Then, the inverse distance density estimation is calculated through the following expression, 
@@ -136,7 +154,7 @@ If the inverse density score of a specific song is low, the more likely it is to
 \begin{figure}[H]
 	\label{invest}
 	\centering
-	\includegraphics[width=\linewidth]{out_KNNdes}	
+	\includegraphics[width=.8\linewidth]{out_KNNdes}	
 \end{figure}
 \noindent
 Another anomaly detection tool is the relative density. The same KNN-model that is fitted to the data with $ K=9 $ is used. Furthermore, the relative density can be calculated with the following expression,
@@ -146,7 +164,7 @@ If the $ \mathrm{ard} < 1 $ then the specific songs is likely to be an outlier.
 \begin{figure}[H]
 	\label{relden}
 	\centering
-	\includegraphics[width=\linewidth]{out_KNNrel}	
+	\includegraphics[width=.8\linewidth]{out_KNNrel}	
 \end{figure}

 \subsection{Three Scoring Methods for Outlier Detection}

--- a/src/main.py
+++ b/src/main.py
@@ -214,6 +214,7 @@ def association_mining1(X, labels):

 if __name__ == "__main__":
 	X, y, attributeNames, song_names = load_data(standardize = True, target = 'tempo', intervals = [90, 100, 110])
+	print(attributeNames)
 	#clusters = clustering1(X)
 	clustering2(X)
 	#clustering2enhalv(X, y, 9)