Mere outlier

af8dcd78 · s183917 · c88d351e · af8dcd78
Commit af8dcd78 authored 5 years ago by s183917
--- a/docs/report/tex/report3.tex
+++ b/docs/report/tex/report3.tex
@@ -67,23 +67,33 @@ The leave-one-out Gaussian Kernel Density estimation is calculated with the foll
 \begin{equation}\label{key}
 p(\mathbf{x})=\sum_{n=1}^{N} \frac{1}{N} \mathcal{N}\left(\mathbf{x} | \mathbf{x}_{n}, \sigma^{2} \mathbf{I}\right)
 \end{equation}
- The kernel density estimation is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the spotify data-set the fitted GMM is a multivariate normal distribution due to the number of features in the date-set. The fitted GMM is then evaluated on the songs in order to calculate their individual density scores. An outlier in this model would then have a low density score, meaning the probability that a song fits into any of the clusters made by the GMM is low. The then lowest density score sogns are illustrated in a bar chart plot below.
+ The kernel density estimation is a way to approximate the probability density function of a random variable in a non-parametric way. In the case of the spotify data-set the fitted GMM is a multivariate normal distribution due to the number of features in the date-set. The fitted GMM is then evaluated on the songs in order to calculate their individual density scores. An outlier in this model would then have a low density score, meaning the probability that a song fits into any of the clusters made by the GMM is low. The lowest density score songs are illustrated in bar chart \ref{GMchart}.
 \begin{figure}[H]
+	\label{GMchart}
 	\centering
 	\includegraphics[width=\linewidth]{out_KDE}	
-\end{figure}
-The k-neighbor estimation detects which objects deviate from normal behavior. First, the inverse distance density estimation is calculated through the following expression, 
-
+	\caption{text}
+\end{figure} \noindent
+The k-neighbor estimation detects which objects deviate from normal behavior. Firstly, the data is fitted to a KNN-model, with K-clusters. Then, the inverse distance density estimation is calculated through the following expression, 
+\[\mathrm{density }_{\mathbf X_{\backslash i}} (\mathbf x_i, K) = \frac{1}{\frac{1}{K} \sum_{\mathbf{x}^\prime \in N_{\mathbf x \backslash i}(\mathbf x_i, K)}d(\mathbf x_i,\mathbf{x^\prime})}  \]
+Where $ x^\prime \in  N_{\mathbf{x \backslash i}(\mathbf{x_i, K})} $ is the nearest K observations to $ x_i $ that are not $ x_i $. 
+ % TODO  find a real K in an appropiate way. 
+If the inverse density score of a specific song is low, the more likely it is to be an outlier. Therefore, the songs with the lowest inverse distance density estimation are illustrated in bar chart \ref{invest}.
 \begin{figure}[H]
+	\label{invest}
 	\centering
 	\includegraphics[width=\linewidth]{out_KDE}	
-\end{figure}
+	\caption{text}
+\end{figure} \noindent
+Another anomaly detection tool is the relative density. The same KNN-model that is fitted to the data with $ K=9 $ is used. Furthermore, the relative density can be calculated with the following expression,
+\[ \mathrm{ard }_{\mathbf X} (\mathbf x_i, K) = \frac{\mathrm{density }_{\mathbf X_{\backslash i}} (\mathbf x_i, K)}{\frac{1}{K} \sum_{\mathbf{x}_j\in N_{\mathbf x \backslash i}(\mathbf x_i, K)}\mathrm {density}_{\mathbf X_{\backslash j}}(\mathbf x_j,\mathbf{x^\prime})} \]
+If the $ \mathrm{ard} < 1 $ then the specific songs is likely to be an outlier. The songs with lowest relative density score are illustrated in \ref{relden}.
 \begin{figure}[H]
+	\label{relden}
 	\centering
 	\includegraphics[width=\linewidth]{out_KNNdes}	
 \end{figure}
-\begin{figure}
-[H]
+\begin{figure}[H]
 	\centering
 	\includegraphics[width=\linewidth]{out_KNNrel}	
 \end{figure}