gepushet

bef0f715 · sorenmulli · 72132b50 · 72132b50 · bef0f715 · 72132b50
Commit bef0f715 authored Dec 1, 2019 by sorenmulli
--- a/docs/report/tex/Billeder/out_KNNdes.png
+++ b/docs/report/tex/Billeder/out_KNNdes.png
--- a/docs/report/tex/Billeder/out_KNNrel.png
+++ b/docs/report/tex/Billeder/out_KNNrel.png
--- a/docs/report/tex/report3.tex
+++ b/docs/report/tex/report3.tex
@@ -135,53 +135,62 @@ This suggest that the clusters of the GMM and Hierachical models are mutually mo

 \section{Outlier Detection}

-
+To understand whether some songs in this 2017 large data set comprised of an arbitrary selection of songs stand out from the others, three methods for outlier detection are implemented.
 \subsection{Ranking songs after typicality}
 %\textit{Rank the observations in terms of leave-one-out Gaussian Kernel Density, KNN Density %and KNN Average Relative Density}
+
+\paragraph{Kernel density estimation }is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the Spotify data-set, this fitted GMM  is a multivariate normal distribution with a diagonal covariance matrix due to the number of features in the data set. 
 The leave-one-out Gaussian Kernel Density estimation is calculated with the following expression,
 \begin{equation}\label{key}
 p(\mathbf{x})=\sum_{n=1}^{N} \frac{1}{N} \mathcal{N}\left(\mathbf{x} | \mathbf{x}_{n}, \sigma^{2} \mathbf{I}\right)
 \end{equation}
- The kernel density estimation is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the spotify data-set, the fitted GMM is a multivariate normal distribution due to the number of features in the date-set. The fitted GMM is then evaluated on the songs in order to calculate their individual density scores. An outlier in this model would then have a low density score, meaning the probability of that a song fitting into any of the clusters made by the GMM is low. The lowest density score songs are illustrated in bar chart \ref{GMchart}.
+
+The fitted GMM is then evaluated on the songs in order to calculate their individual density scores. An outlier in this model would then have a low density score, meaning the probability of that a song fitting into any of the clusters made by the GMM is low.
+The lowest density score songs are illustrated in figure \ref{GMchart}.
 \begin{figure}[H]
-	\label{GMchart}
 	\centering
 	\includegraphics[width=.8\linewidth]{out_KDE}	
-	\caption{text}
+	\caption{The 10 songs with lowest probability in the KDE. Note that they all almost have the same probability such that y-axis is at a log scale.}
+	\label{GMchart}
 \end{figure} \noindent
-The k-neighbor estimation detects which objects deviate from normal behavior. Firstly, the data is fitted to a KNN-model with K clusters. Then, the inverse distance density estimation is calculated through the following expression, 
+\paragraph{K neighbours density estimation} detects which objects deviate from normal behavior. Firstly, the data is fitted to a KNN-model with K clusters. Then, the inverse distance density estimation is calculated through the following expression, 
 \[\mathrm{density }_{\mathbf X_{\backslash i}} (\mathbf x_i, K) = \frac{1}{\frac{1}{K} \sum_{\mathbf{x}^\prime \in N_{\mathbf x \backslash i}(\mathbf x_i, K)}d(\mathbf x_i,\mathbf{x^\prime})}  \]
-Where $ x^\prime \in  N_{\mathbf{x \backslash i}(\mathbf{x_i, K})} $ are the nearest K observations to $ x_i $ that are not $ x_i $. 
- % TODO  find a real K in an appropiate way. 
-If the inverse density score of a specific song is low, it is more likely to be an outlier. Therefore, the songs with the lowest inverse distance density estimation are illustrated in bar chart \ref{invest}.
+Where $ x^\prime \in  N_{\mathbf{x \backslash i}(\mathbf{x_i, K})} $ are the nearest K observations to $ x_i $ that are not $ x_i $. \(K=50\) is chosen somewhat arbitrarily as no clear way to evaluate the model is available and in the previous report, the KNN \textit{classifier} was found to be optimal via cross validation at \(K=50\).
+
+
+If the inverse density score of a specific song is low, it is more likely to be an outlier. Therefore, the songs with the lowest inverse distance density estimation are illustrated in figure \ref{invest}.
 \begin{figure}[H]
-	\label{invest}
 	\centering
-	\includegraphics[width=.8\linewidth]{out_KNNdes}	
+\includegraphics[width=.77\linewidth]{out_KNNdes}	\caption{The ten songs with lowest density under the KNN model. The first song "The Nearness of You"\ seems like an outlier.}	
+\label{invest}
 \end{figure}
 \noindent
-Another anomaly detection tool is the relative density. The same KNN-model that is fitted to the data with $ K=9 $ is used. Furthermore, the relative density can be calculated with the following expression,
+\paragraph{The relative density estimation} also takes into account the density of other songs in a certain songs vicinity. The same KNN-model that is fitted to the data also with $ K=50 $. Furthermore, the relative density can be calculated with the following expression,
 \[ \mathrm{ard }_{\mathbf X} (\mathbf x_i, K) = \frac{\mathrm{density }_{\mathbf X_{\backslash i}} (\mathbf x_i, K)}{\frac{1}{K} \sum_{\mathbf{x}_j\in N_{\mathbf x \backslash i}(\mathbf x_i, K)}\mathrm {density}_{\mathbf X_{\backslash j}}(\mathbf x_j,\mathbf{x^\prime})} \]
 If the $ \mathrm{ard} < 1 $ then the specific songs is likely to be an outlier. The songs with lowest relative density score are illustrated in \ref{relden}.

 \begin{figure}[H]
-	\label{relden}
 	\centering
-	\includegraphics[width=.8\linewidth]{out_KNNrel}	
+	\includegraphics[width=.77\linewidth]{out_KNNrel}
+	\caption{The ten songs with lowest average relative density under the KNN. The songs are quite different from the absolute density indicating that the density of the data attribute space varies quite a bit.}
+		\label{relden}
 \end{figure}

-\subsection{Three Scoring Methods for Outlier Detection}
-From the illustrations in the former exercise of ranking observations, it is evident that the leave-one-out Gaussian Kernel Density finds a bunch of different outliers, including "Southern Man", "The Nearness of You", "Music is the Answer", "Willing and Able" "Loner" and possibly "Oldie". It makes sense that this method finds a lot of outliers, as this density estimation is a rather basic multivariate normal distribution and the data is somewhat likely to lie far away from the density mean, as 4.54\% of the data lies outside of two standard deviations.
+\subsection{Are there any outliers?}
+From the illustrations in the former exercise of ranking observations, it is evident that the leave-one-out Gaussian Kernel Density finds a number of songs which seem like outliers on this log scale including "Southern Man", "The Nearness of You", "Music is the Answer", "Willing and Able" "Loner" and possibly "Oldie". It seems that model chooses all the ten plotted songs as candidate outliers as they all take on the probability of \(7.9\ctp{-5}\) while the mean estimated probability over all the songs is found to be \(3.3\ctp{-3}\)

-Regarding the illustration of K-Nearest Neighbor (KNN) density, "The Nearness of You", "Southern Man", "Willing and Able" and "Music is the Answer" are also the most probable outliers. Meanwhile, the Average Relative Density (ARD) has been used to just find the one outlier "Mask Off", as the density of this observation is a drastic enough amount lower than the other observation.
+Regarding the illustration of K-Nearest Neighbor (KNN) density, "The Nearness of You", "Southern Man", "Willing and Able" and "Music is the Answer" and "Viola Sonata" are the most probable outliers. Meanwhile, the Average Relative Density (ARD) has been used to find what seems like only three possible outliers "Mask Off", "Redbone", "Master Of None".

-To find the most probable outliers in the dataset, the best cause of action is to compare the results of the three methods covered above. This comparison shows that the three different sorting methods for outlier detection find "The Nearness of You" and "Southern Man" to be (some of ) the most probable outliers, as the Gaussian Kernel Density and KNN Density have these probable outliers in common. "Mask Off" could also be an outlier, as this observation has the lowest density in the ARD but is not noted as such in any of the other methods.	
+To find the most probable outliers in the dataset, the best cause of action is to compare the results of the three methods covered above. This comparison shows that the three different sorting methods for outlier detection find "The Nearness of You" and "Southern Man" to be (some of) the more probable outliers, as the Gaussian Kernel Density and KNN Density have these probable outliers in common. "Mask Off" could also be an outlier, as this observation has the lowest density in the ARD but is not noted as such in any of the other methods.

-\section{Association Mining}
-In this part of the report, the data has to be binarized in order to use the Apriori algorithm to find associations between observations. Since the data of this report is full of categorical variables, the data has been one-out-of-K encoded for three different intervals for each of the variables. The intervals for every variable is found as the 0.33 and 0.66 quantile as well as the max value.
+It is noted that the results from the ARD differ significantly from the results from the other two indiciting that takin the relative density into account makes a significant difference. This can be due to a noisy data set with very inhomogeneous data density. It could also be a general result for high dimensional data where data points in general are quite far from each other.
+\\\\
+When looking at outliers in a probabilistic sense, some of these candidates can be seen as outliers. However, if outliers are understood as coming from erroneous data or coming from another data distribution as in Hawkin's definition, it cannot be concluded from this exercise that the data contains any outliers: No song consistently stands dramatically out and the found candidates could just be edge cases from the same distribution as every probability distribution is expected to have some kind of a tail.
+\section{Association Mining: What goes together in a song?}
+In this part of the report, the data has to be binarized in order to use the mathematical framework of itemsets and transactions to examine the data and find associations between observations. Since the data of this report is full of continuous variables, the data has been one-out-of-K. Three different intervals for each of the variables is chosen with limits at the \(\frac13 \text{ and } \frac 23\) quantiles.

 \subsection{Apriori Algorithm for Frequent Itemsets and Association Rules}
-In this section of the report, the results of running the Apriori algorithm on the data of the report are explored. To run the algorithm, a minimum support and confidence of 0.11 and 0.6 have been chosen after some amounts of exploration into the most optimal parameters. The results of the algorithm are shown in the table below.
+In this section of the report, the results of running the Apriori algorithm on the data of the report are explored. To run the algorithm, a minimum support and confidence of 0.11 and 0.6 have been chosen after some amounts of exploration into the most optimal parameters. Slightly higher thresholds  of support yielded very few results while slightly lower thresholds of confidence made the amount of associations explode.  The results of the algorithm are shown in the table below.

 	\begin{table}[H]
 		\centering
@@ -198,12 +207,19 @@ In this section of the report, the results of running the Apriori algorithm on t
 			\{valence\_low, energy\_low\} & $\rightarrow$ & \{loudness\_low\}   &  0.112    & 0.708  \\
 			\{loudness\_low, valence\_low\} & $\rightarrow$ & \{energy\_low\}   &  0.112    & 0.833  \\ \bottomrule
 		\end{tabular}
+	\caption{The nine found association rules with the above stated minimum confidence and support. The table has been divided for readability.}
 	\end{table}

-\noindent All of the association rules seem to have a very low support, which could have to do with the fact that the data contains both a huge amount of attributes (27 after binarization) and a lot of possible combinations of these attributes (134,217,728 to be exact). The confidence, on the other hand, seems to be relatively high, which indicates that at least some of the association rules have some merit behind them.
+\noindent All of the association rules seem to have a very low support, which could have to do with the fact that the data contains both a huge amount of attributes: 27, after binarization, and a lot of possible combinations of these attributes:, 134,217,728 to be exact. The confidence, on the other hand, seems to be relatively high, which indicates that the association rules have some merit behind them.

-\subsection{Interpretation of the Association Rules}
-Analysing the association rules, it is obvious that most of them focus on a energy-loudness-valence relationship. This is most obvious in the first section of the table, which communicates that if and only if the energy is low, then the loudness is low. Meanwhile, if and only if the energy is high, then the loudness is also high. This trend also seems to be mirrored in the other two sections, further reinforcing the findings of the algorithm. Generally, Aproiri has found that quiet, passive songs have a tendency to have a high amount of acousticness, while energetic songs have low acousticness and high loudness. This tends to be true when comparing to songs like "Girlfriend" by "Avriel Lavigne", which is a very energic and loud song and thus, like the algorithm predicts, it is not a very acousticc song.
+\subsection{Do the rules make sense?}
+Analysing the association rules, it seems that three distinct sets of rules are found:
+\begin{itemize}
+	\item The first part of the table communicates an equivalence relationship between low energy and low loudness \textit{and} high energy and high loudness which is very logical and one would expect oneself to be likely to classify a loud song as energetic.
+	\item  The next section of the table shows tells part of the same story but also includes acousticness and connects low acousticness high accousticness to low loudness and low energy.
+	\item The third section connects valence, the measure of how positive a song is, to energy and loudness. Low valence together with low energy often results in low loudness and also the other way around; low loudness and valence results in low energy.
+\end{itemize}
+Generally, the apriori algorithm has found some pattern of quiet, low-energy songs having a tendency to have a high amount of acousticness, while energetic often are electronic and loud. This relationship can be recognized from general experience with an example from the data set being "Girlfriend" by "Avriel Lavigne", which is a very energic and loud song and thus, like the algorithm predicts, it is not a very acousticc song.


 \end{document}
--- a/src/main.py
+++ b/src/main.py
@@ -154,6 +154,8 @@ def outlier1(X, songnames):
 	idx = scores.argsort()
 	scores.sort()

+	print(scores.mean())
+	print(scores.std())
 	# Plot kernel density estimate
 	plt.bar(range(10),scores[:10], log = True)
 	print(scores[:10])
@@ -162,7 +164,7 @@ def outlier1(X, songnames):
 	plt.yticks([7.90598304e-05, 7.90598320e-05, 7.90598337e-05, 7.90598352e-05])
 	plt.show()
 	
-	K = 9 #TODO ????
+	K = 50 #TODO ????
 	knn = NearestNeighbors(n_neighbors=K).fit(X)
 	D, i = knn.kneighbors(X)
 	density = 1./(D.sum(axis=1)/K)
@@ -213,14 +215,14 @@ def association_mining1(X, labels):


 if __name__ == "__main__":
-	X, y, attributeNames, song_names = load_data(standardize = True, target = 'tempo', intervals = [90, 100, 110])
-	print(attributeNames)
+	#X, y, attributeNames, song_names = load_data(standardize = True, target = 'tempo', intervals = [90, 100, 110])
+	#print(attributeNames)
 	#clusters = clustering1(X)
-	clustering2(X)
+	#clustering2(X)
 	#clustering2enhalv(X, y, 9)
-	clustering3(X, y, 9)
+	#clustering3(X, y, 9)
 	#song_names = list(song_names)
 	#outlier1(X, song_names)

-#	X, attributeNames, song_names = load_data_binarized(3)
-#	association_mining1(X, attributeNames)
\ No newline at end of file
+	X, attributeNames, song_names = load_data_binarized(3)
+	association_mining1(X, attributeNames)
\ No newline at end of file