Skip to content
Snippets Groups Projects
Commit 61b92c90 authored by Anders Henriksen's avatar Anders Henriksen
Browse files

Del 3

parent ad2f9f8d
No related branches found
No related tags found
No related merge requests found
...@@ -108,22 +108,12 @@ Regarding the K-Nearest Neighbor (KNN) density, it can be shown that "My Heart W ...@@ -108,22 +108,12 @@ Regarding the K-Nearest Neighbor (KNN) density, it can be shown that "My Heart W
To find the most probable outliers in the dataset, the best cause of action is to compare the results of the three methods covered above. This comparison shows that the three different sorting methods for outlier detection find "My Heart Will Go On" and "The Nearness of You" to be the most probable outliers, as the Gaussian Kernel Density and KNN Density have these probable outliers in common. "Redbone" could also be an outlier, as this observation has the lowest density in the ARD but is not noted as such in any of the other methods. To find the most probable outliers in the dataset, the best cause of action is to compare the results of the three methods covered above. This comparison shows that the three different sorting methods for outlier detection find "My Heart Will Go On" and "The Nearness of You" to be the most probable outliers, as the Gaussian Kernel Density and KNN Density have these probable outliers in common. "Redbone" could also be an outlier, as this observation has the lowest density in the ARD but is not noted as such in any of the other methods.
\[\mathrm{density }_{\mathbf{X_{\backslash i}}} (\mathbf{x_i}, K) = \frac{1}{\frac{1}{K} \sum_{\mathbf{x}_j \in N_{\mathbf{x \backslash i}(\mathbf{x_i, K})}}} \]
\section{Association Mining} \section{Association Mining}
In this part of the report, the data has to be binarized in order to use the Apriori algorithm to find associations between observations. Since the data of this report is full of categorical variables, the data has been one-out-of-K encoded for different intervals of each variable. In this part of the report, the data has to be binarized in order to use the Apriori algorithm to find associations between observations. Since the data of this report is full of categorical variables, the data has been one-out-of-K encoded for three different intervals for each of the variables. The intervals for every variable is found as the 0.33 and 0.66 quantile as well as the max value.
\subsection{Apriori Algorithm for Frequent Itemsets and Association Rules} \subsection{Apriori Algorithm for Frequent Itemsets and Association Rules}
\textit{Find the frequent itemsets and the association rules with high confidence based on the results of the Apriori algorithm.} \\ \textit{Find the frequent itemsets and the association rules with high confidence based on the results of the Apriori algorithm.} \\
In this section of the report, the results of running the Apriori algorithm on the data of the report are explored. To run the algorithm, a minimum support and confidence of 0.11 and 0.6 have been chosen after some amounts of exploration into the most optimal parameters. The results of the algorithm are shown in the table below.
Hej Per-parametre
\[
\texttt{minsup} = 0.11\qquad \texttt{minconf} = 0.6
\]
\begin{table}[H] \begin{table}[H]
\centering \centering
...@@ -142,22 +132,11 @@ Hej Per-parametre ...@@ -142,22 +132,11 @@ Hej Per-parametre
\end{tabular} \end{tabular}
\end{table} \end{table}
%{energy_q1} -> {loudness_q1} (supp: 0.226, conf: 0.674) \noindent All of the association rules seem to have a very low support, which could have to do with the fact that the data contains both a huge amount of attributes (27 after binarization) and a lot of possible combinations of these attributes (134,217,728 to be exact). The confidence, on the other hand, seems to be relatively high, which indicates that at least some of the association rules have some merit behind them.
%{loudness_q1} -> {energy_q1} (supp: 0.226, conf: 0.676)
%{energy_q3} -> {loudness_q3} (supp: 0.210, conf: 0.634)
%{loudness_q3} -> {energy_q3} (supp: 0.210, conf: 0.631)
%{acousticness_q3, energy_q1} -> {loudness_q1} (supp: 0.142, conf: 0.740)
%{loudness_q1, acousticness_q3} -> {energy_q1} (supp: 0.142, conf: 0.842)
%{loudness_q1, energy_q1} -> {acousticness_q3} (supp: 0.142, conf: 0.631)
%{valence_q1, energy_q1} -> {loudness_q1} (supp: 0.112, conf: 0.708)
%{loudness_q1, valence_q1} -> {energy_q1} (supp: 0.112, conf: 0.833)
\subsection{Interpretation of the Association Rules} \subsection{Interpretation of the Association Rules}
\textit{Interpret the generated association rules.} \\ \textit{Interpret the generated association rules.} \\
Analysing the association rules, it is obvious that most of them focus on a energy-loudness-valence relationship. This is most obvious in the first section of the table, which communicates that if and only if the energy is low, then the loudness is low. Meanwhile, if and only if the energy is high, then the loudness is also high. This trend also seems to be mirrored in the other two sections, further reinforcing the findings of the algorithm. Generally, Aproiri has found that quiet, passive songs have a tendency to have a high amount of acousticness, while energetic songs have low acousticness and high loudness. This seems to be logican when comparing to songs like "Girlfriend" by "Avriel Lavigne", which is a very energic and loud song and thus, like the algorithm predicts, it is not a very acousticc song.
\end{document} \end{document}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment