02456

d4923ec6 · manxilin · 647cd2ea · d4923ec6 · d4923ec6 · d4923ec6
Commit d4923ec6 authored 4 years ago by manxilin
--- a/02456/1_Feedforward_Pen_and_Paper/1.1-FNN-Pen-and-Paper (3).ipynb
+++ b/02456/1_Feedforward_Pen_and_Paper/1.1-FNN-Pen-and-Paper (3).ipynb
--- a/02456/1_Feedforward_Pen_and_Paper/1.1-FNN-Pen-and-Paper.ipynb
+++ b/02456/1_Feedforward_Pen_and_Paper/1.1-FNN-Pen-and-Paper.ipynb
@@ -164,7 +164,7 @@
    "y_j = h_2(a_j^{(2)})\\\\\n",
    "= h_2({\\sum_i^M w^{(2)}_{ji}z_i^{(1)}})\\\\\n",
    "= h_2({\\sum_i^M w^{(2)}_{ji}h_1(a_j^{(1)})})\\\\\n",
-    "= h_2({\\sum_i^M w^{(2)}_{ji}h_1(\\sum_{i=1}^D w_{ji}^{(1)}x_i+w_{j0}^{(1)})})\n",
+    "= h_2({\\sum_i^M w^{(2)}_{ji}h_1(\\sum_{i=0}^D w_{ji}^{(1)}x_i)})\n",
    "$$"
   ]
  },
@@ -190,7 +190,7 @@
    "## Answer b)\n",
    "$$\n",
    "y_j = h_3(\\sum_{i=1}^K w^{(3)}_{ji}z_i^{(2)}) \\\\\n",
-    " = h_3(\\sum_{i=1}^K w^{(3)}_{ji}h_2({\\sum_i^M w^{(2)}_{ji}h_1(\\sum_{i=1}^D w_{ji}^{(1)}x_i+w_{j0}^{(1)})}))\n",
+    " = h_3(\\sum_{i=1}^K w^{(3)}_{ji}h_2({\\sum_i^M w^{(2)}_{ji}h_1(\\sum_{i=0}^D w_{ji}^{(1)}x_i)}))\n",
    "$$"
   ]
  },
@@ -224,7 +224,7 @@
    "y_j = z_j^{(L)}\\\\\n",
    "z_j^{l} = h_l(a_j^{(l)})  & & l=1,\\ldots,L \\\\\n",
    "a_j^{(l)} = \\sum_{i=1}^M w_{ji}^{(l)} x_i & & l=2,\\ldots,L \\\\\n",
-    "a^{(1)}_j = \\sum_{i=1}^D w^{(1)}_{ji} x_i + w^{(1)}_{j0} \\\\\n",
+    "a^{(1)}_j = \\sum_{i=0}^D w^{(1)}_{ji} x_i \\\\\n",
    "\\end{align}\n",
    "$$"
   ]
@@ -617,8 +617,18 @@
    "E(w) = \\sum_{n=1}^N || \\mathbf{y}(\\mathbf{x}_n) - \\mathbf{t}_n||_2^2\n",
    "$$\n",
    "$$\n",
-    "y(x) = \\sum_{i=0}^3 w^{(2)}_i \\tanh( w^{(1)}_i x ) \n",
+    "\\nabla E(w) = \\sum_{n=1}^N 2|| \\mathbf{y}(\\mathbf{x}_n) - \\mathbf{t}_n||_2\\cdot \\nabla  \\mathbf{y}(\\mathbf{x}_n)\n",
-    "$$"
+    "$$\n",
+    "For $M$ randomly selected samples in the training set, To be more clear, we assume the $M$ points to be the first $M$ points in the set. In reality it can be in anywhere in the training set. \n",
+    "$$\n",
+    "\\nabla E_s(w) = \\sum_{n=1}^M 2|| \\mathbf{y}(\\mathbf{x}_n) - \\mathbf{t}_n||_2\\cdot \\nabla  \\mathbf{y}(\\mathbf{x}_n)\n",
+    "$$\n",
+    "\\\n",
+    "$$\n",
+    "\\nabla E(w) = \\sum_{n=1}^M 2|| \\mathbf{y}(\\mathbf{x}_n) - \\mathbf{t}_n||_2\\cdot \\nabla  \\mathbf{y}(\\mathbf{x}_n) + \\sum_{n=M+1}^{N-M} 2|| \\mathbf{y}(\\mathbf{x}_n) - \\mathbf{t}_n||_2\\cdot \\nabla  \\mathbf{y}(\\mathbf{x}_n)\\\\\n",
+    " = \\nabla E_s(w) + \\sum_{n=M+1}^{N-M} 2|| \\mathbf{y}(\\mathbf{x}_n) - \\mathbf{t}_n||_2\\cdot \\nabla  \\mathbf{y}(\\mathbf{x}_n)\n",
+    "$$\n",
+    "It can be seen from the equation below that, as $M$ is proportional to $N$, the gradient $\\nabla E_s(w)$ calculated on a random subset of the training set on average is proportional to the true gradient $\\nabla E(w)$. "
   ]
  },
  {
@@ -683,6 +693,56 @@
    "to get the final result."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Answer i)\n",
+    "Considering only one sample, for classification problem,\n",
+    "$$\n",
+    "E(w) = -\\sum_{k=1}^K log{t_k}y_k\\\\\n",
+    "$$\n",
+    "$$\n",
+    "\\delta_j^{(L)} = \\frac{\\partial E(w)}{\\partial a_j^{(L)}}\\\\\n",
+    " =  \\frac{\\partial E(w)}{\\partial y_k} \\cdot  \\frac{\\partial y_k}{\\partial a_j^{(L)}}\\\\\n",
+    " = -\\sum_{k=1}^K \\frac{t_k}{y_k} \\cdot \\frac{\\partial y_k}{\\partial a_j^{(L)}}\n",
+    "$$\n",
+    "Let's discuss $ \\frac{\\partial y_k}{\\partial a_j^{(L)}}$.\n",
+    "$$\n",
+    "y_k = \\frac{\\exp ( a_k^{(L)} )}{\\sum_i \\exp ( a_i^{(L)} )}\n",
+    "$$\n",
+    "$$\n",
+    "\\frac{\\partial y_k}{\\partial a_j^{(L)}} = \\frac{\\frac{\\partial exp(a_k^{(L)})}{\\partial a_j^{(L)}}\\cdot \\sum_i exp(a_i^{(L)}) - exp(a_k^{(L)})\\cdot exp(a_j^{(L)})}{(\\sum_i exp(a_i^{(L)})^2}\n",
+    "$$\n",
+    "There are two different cases.\\\n",
+    "When $k=j$:\n",
+    "$$\n",
+    "\\frac{\\partial y_k}{\\partial a_j^{(L)}} = \\frac{exp(a_k^{(L)})\\cdot \\sum_i exp(a_i^{(L)}) - exp(a_k^{(L)})^2}{(\\sum_i exp(a_i^{(L)})^2}\\\\\n",
+    " = y_k -y_k^2\\\\\n",
+    " = y_k(1-y_k)\n",
+    "$$\n",
+    "When $k\\neq j$:\n",
+    "$$\n",
+    "\\frac{\\partial y_k}{\\partial a_j^{(L)}} = \\frac{- exp(a_k^{(L)})\\cdot exp(a_j^{(L)})}{(\\sum_i exp(a_i^{(L)})^2}\\\\\n",
+    " = -y_k\\cdot y_j\n",
+    "$$\n",
+    "When taking $ \\frac{\\partial y_k}{\\partial a_j^{(L)}}$ into $\\delta_j^{(L)}$, there is:\n",
+    "$$\n",
+    "\\delta_j^{(L)} =  \\sum_{k=1}^K f_k\n",
+    "$$\n",
+    "$$\n",
+    "f_k = t_k(y_j-1) \\qquad when \\qquad k=j\\\\\n",
+    "f_k = t_k\\cdot y_j \\qquad when \\qquad k\\neq j \n",
+    "$$ \n",
+    "Considering $t_k$ is in one-hot format, \n",
+    "$$\n",
+    "\\delta_j^{(L)} =  y_j-1 \\qquad when \\qquad j = \\textbf{k}\\\\\n",
+    "\\delta_j^{(L)} =  y_j \\qquad when \\qquad j \\neq \\textbf{k}  \n",
+    "$$\n",
+    "Here, $\\textbf{k}$ means that the point is in class $\\textbf{k}$. Note that $\\textbf{k}$ is different from the indice counter $k$ I used before.  \n",
+    "It's easy for us to notice that, $\\delta_j^{(L)} = y_j - t_s$, where $t_s$ is the supposed output of the $j$-th neutron."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
@@ -739,6 +799,40 @@
    "$$"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Answer j)\n",
+    "### Proof 1: $\\frac{\\partial E(w)}{\\partial w^{(l)}_{ji}} = \\delta^{(l)}_j  z^{(l-1)}_i$\n",
+    "$$\n",
+    "\\frac{\\partial E(w)}{\\partial w^{(l)}_{ji}} = \\sum_{k=1}^K \\frac{\\partial E(w)}{\\partial a_k^{(l)}}\\cdot \\frac{\\partial a_k^{(l)}}{\\partial w^{(l)}_{ji}}\\\\\n",
+    "$$\n",
+    "Since \n",
+    "$$\n",
+    "a_k^{(l)} = \\sum_i w_{ki}^{(l)} \\cdot z_i^{(l-1)}\n",
+    "$$\n",
+    ", we can conclude that, if and only if $k=j$, $\\frac{\\partial a_k^{(l)}}{\\partial w_{ji}^{(l)}}\\neq 0$, and \n",
+    "$$\n",
+    "\\frac{\\partial a_k^{(l)}}{\\partial w_{ji}^{(l)}} = z_i^{(l-1)}\n",
+    "$$\n",
+    "Then\n",
+    "$$\n",
+    "\\frac{\\partial E(w)}{\\partial w^{(l)}_{ji}} = \\frac{\\partial E(w)}{\\partial a_j^{(l)}}\\cdot z_i^{(l-1)}\\\\\n",
+    " = \\delta^{(l)}_j  z^{(l-1)}_i\n",
+    "$$\n",
+    "\n",
+    "### Proof 2: $\\delta^{(l)}_j = \\sum_{k=1}^K \\delta^{(l+1)}_k  w^{(l+1)}_{kj} h_{l}'(a^{(l)}_j)$ \n",
+    "$$\n",
+    "\\delta^{(l)}_j = \\frac{\\partial E(w)}{\\partial a_j^{(l)}}\\\\\n",
+    " = \\sum_{k=1}^K \\frac{\\partial E(w)}{\\partial a_k^{(l+1)}}\\cdot \\frac{\\partial a_k^{(l+1)}}{\\partial a_j^{(l)}}\\\\\n",
+    " = \\sum_{k=1}^K \\frac{\\partial E(w)}{\\partial a_k^{(l+1)}}\\cdot \\frac{\\partial (\\sum_i w_{ki}^{(l+1)}\\cdot z_i^{(l)})}{\\partial a_j^{(l)}}\\\\\n",
+    " = \\sum_{k=1}^K \\frac{\\partial E(w)}{\\partial a_k^{(l+1)}}\\cdot \\frac{\\partial (\\sum_i w_{ki}^{(l+1)}\\cdot h_l(a_j^{(l)}))}{\\partial a_j^{(l)}}\\\\\n",
+    " = \\sum_{k=1}^K \\frac{\\partial E(w)}{\\partial a_k^{(l+1)}}\\cdot \\frac{w_{kj}^{(l+1)}\\cdot \\partial h_l(a_j^{(l)})}{\\partial a_j^{(l)}}\\\\\n",
+    " = \\sum_{k=1}^K \\delta^{(l+1)}_k  w^{(l+1)}_{kj} h_{l}'(a^{(l)}_j)\n",
+    "$$\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
@@ -750,6 +844,24 @@
    "k) Derive the backpropagation rule for regression. That is perform the calculation in exercise i) for the regression loss. Everything else stays the same. Actually it turns out that is you use $h_L(a) = a$ then you even get the same result as in i)."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Answer k)\n",
+    "For regression, take the sum of squares for example.\n",
+    "$$\n",
+    "E(w) = \\frac{1}{2}|| \\mathbf{y}(\\mathbf{x}) - \\mathbf{t}||_2^2\\\\\n",
+    " =  \\frac{1}{2} \\sum_{k=1}^K (y_k-t_k)^2\n",
+    "$$\n",
+    "$\\frac{1}{2}$ here is only used to scale the loss function for convenience. It won't affect the result. \n",
+    "$$\n",
+    "\\delta_j^{(L)} = \\frac{\\partial E(w)}{\\partial a_j^{(L)}}\\\\\n",
+    " = (y_j-t_j) \\cdot \\frac{\\partial h_L(a_j^{(L)})}{\\partial a^{(L)}}\n",
+    "$$\n",
+    "It's apperant that when $h_l(a)=a$, we get the same result as in i), that is, $\\delta_j^{(L)} = y_j-t_j$, where $t_j$ is the supposed output of the $j$-th neuron. "
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
@@ -762,6 +874,48 @@
    "Hint: This only requires introducing a sum over $n$ and $n$ indices in a few places."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Answer l)\n",
+    "When there are $N$ samples in a dataset.\n",
+    "$$\n",
+    "E(w) = -\\sum_{n=1}^N \\sum_{k=1}^K log(t_{nk}y_k(x_n))\\\\\n",
+    "= -\\sum_{n=1}^N log(y_\\textbf{k}(x_n))\n",
+    "$$\n",
+    "where $\\textbf{k}$ means the class of $x_n$ in ground truth.It's different from $k$.  \n",
+    "\\\n",
+    "$$\n",
+    "y_{\\textbf{k}}(x_n) = \\frac{exp(a_{n\\textbf{k}}^{(L)})}{\\sum_i exp(a_{ni}^{(L)})}\n",
+    "$$\n",
+    "Let's assume \n",
+    "$$\n",
+    "\\delta_j^{(L)} = \\sum_{n=1}^N \\frac{\\partial E(w)}{\\partial a_{nj}^{(L)}}\\\\\n",
+    " = \\sum_{n=1}^N \\frac{\\partial E(w)}{\\partial y_{\\textbf{k}}(x_n)} \\cdot \\frac{\\partial y_{\\textbf{k}}(x_n)}{\\partial a_{nj}^{(L)}}\\\\\n",
+    " = -\\sum_{n=1}^N \\frac{1}{y_{\\textbf{k}}(x_n)} \\cdot  \\frac{\\frac{\\partial exp(a_{n\\textbf{k}}^{(L)})}{\\partial a_{nj}^{(L)}}\\cdot \\sum_i exp(a_{ni}^{(L)}) - exp(a_{n\\textbf{k}}^{(L)})\\cdot exp(a_{nj}^{(L)})}{(\\sum_i exp(a_{ni}^{(L)})^2}\n",
+    "$$\n",
+    "There are 2 cases.\\\n",
+    "When $j=\\textbf{k}$,\n",
+    "$$\n",
+    "\\delta_j^{(L)} = -\\sum_{n=1}^N \\frac{1}{y_{\\textbf{k}}(x_n)} \\cdot  \\frac{exp(a_{n\\textbf{k}}^{(L)})\\cdot \\sum_i exp(a_{ni}^{(L)} - exp(a_{n\\textbf{k}}^{(L)})^2}{(\\sum_i exp(a_{ni}^{(L)})^2}\\\\\n",
+    " = \\sum_{n=1}^N (y_{\\textbf{k}}(x_n)-1)\n",
+    "$$\n",
+    "When $j\\neq \\textbf{k}$,\n",
+    "$$\n",
+    "\\delta_j^{(L)} = -\\sum_{n=1}^N \\frac{1}{y_{\\textbf{k}}(x_n)} \\cdot  \\frac{- exp(a_{n\\textbf{k}}^{(L)})\\cdot exp(a_{nj}^{(L)})}{(\\sum_i exp(a_{ni}^{(L)})^2}\\\\\n",
+    " = \\sum_{n=1}^N y_{\\textbf{k}}(x_n)\n",
+    "$$\n",
+    "From the results below, we can conclude that, a dataset of size greater than one doesn't result in a complicate $\\delta_j^{(L)}$. It only introduces a sum over $n$ and $n$ indices in a few places. More specifically, it indicates that the differentiation over a dataset is independent for each sample. Thus, it's easy to modify the backproppagation rules. For example, if we choose average as the metric, the backpropagation rules will be revised as:\n",
+    "$$\n",
+    "\\begin{align}\n",
+    "\\frac{\\partial E(w)}{\\partial w^{(l)}_{ji}} & = \\frac{1}{N}\\sum_{n=1}^N \\delta^{(l)}_{nj}  z^{(l-1)}_{ni} \\\\\n",
+    "\\delta^{(l)}_{nj} & = \\sum_{k=1}^K \\delta^{(l+1)}_{nk}  w^{(l+1)}_{kj} h_{l}'(a^{(l)}_{nj}) \n",
+    "\\end{align}\n",
+    "$$\n",
+    "where $\\delta^{(l)}_{nj}$, $z^{(l-1)}_{ni}$, and $a^{(l)}_{nj}$ are exactly what we get in question i) and j). "
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {

 %% Cell type:markdown id: tags:
 # 02456 Deep Learning Exercise 1 Pen and Paper
 %% Cell type:markdown id: tags:
 # Contents and why we need this lab
 The first lab is all about understanding the mathematical concepts behind neural networks. You will derive everything by hand this first week. The hand-in this week is your modifications to this notebook.
 In week two you will program what you have derived this week in  [NumPy](https://numpy.org/) and in week three and onwards you will work in a framework dedicated to making it easy to write deep learning code. In this course we will use [PyTorch](https://pytorch.org/) because it is more Python-like than for example TensorFlow. However, TensorFlow has some advantages if you want to deploy deep learning models. For learning and research PyTorch is right now the preferred framework. But this is a quite dynamic field where a lot is happening, so it is hard to say what is the preferred framework a year from now.
 Linear algebra, probability theory, statistics and optimization are all underpinning deep learning. The availability of coding frameworks for machine learning is to some degree hiding this and making it possible to make quite impressive applications without deep understanding of the underlying mathematical concepts. This is both a blessing and a curse. The ambition of this course is to educate first class deep learners that can develop deep learning solutions tailored to the problem at hand. Therefore we need the fundamental understanding. We have experienced that neglecting the fundamental will always come back to hunt you later when we encounter real world problems. So sharpen your pen and get ready to learn new stuff or more likely refreshen some stuff you forgot you knew years ago. :-)
 %% Cell type:markdown id: tags:
 # List of contents
 1. External sources of information
 2. Neural networks - the feed-forward model
 3. Loss functions and maximum likelihood
 4. Stochastic gradient descent
 5. Error backpropagation
 %% Cell type:markdown id: tags:
 # External sources of information
 1. Book. The notation will to a high degree follow that of [Chris Bishop, Pattern recognition and machine learning](https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book). This book is freely available online and is still a very valuable source of information of all things machine learning.
 2. Jupyter notebook. You can find more information about Jupyter notebooks [here](https://jupyter.org/). It will come as part of the [Anaconda](https://www.anaconda.com/) Python installation.
 3. Markdown. Jupyter notebook's uses cells. A cell is executed by pressing the Run tab above. In the lab you will only use Markdown cells. You add cell by pressing + tab above of choosing it from the Insert menu also above. A good overview of Markdown formatting can be found [here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). For equations Markdown uses [latex](https://en.wikibooks.org/wiki/LaTeX/Mathematics).
 4. [Mathematics for machine learning](https://mml-book.github.io/book/mml-book.pdf) is a book that where the title says it all.
 %% Cell type:markdown id: tags:
 # Neural networks - the feed forward model
 We will meet different variants of neural network models throughout the course. We need different architectures because they have different type of symmetries that reflect data we encounter with differnt spatial and temporal invariances.
 To get started we work with the most basic variant: the so-called *feed-forward neural network* (FFNN). A FFNN with one hidden layer with $M$ hidden unit and two layers of adaptable parameters (weights) denoted by $w=w^{(1)},w^{(2)}$ is plotted below. In literature sometimes $\theta$ will be used instead of $w$.
 %% Cell type:code id: tags:
 ``` python
 # Figure 5.1 from Bishop
 from IPython.display import Image
 Image("figures/Figure5.1.jpg",width=500)
 ```
 %% Output
    <IPython.core.display.Image object>
 %% Cell type:markdown id: tags:
 We are considering supervised learning where we both have inputs (covariates) and assocatiated outputs (labels, targets, response variables) available. The network has $D$ inputs: $\mathbf{x}=x_1,\ldots,x_D$ and $K$ outputs $\mathbf{y}=y_1,\ldots,y_K$. The algorithm is called a neural network because the architecture of the model resembles the biological network between neurons in our brains.
 This computation is layered. Each layer first computes what is equivalent to the output $a_j$ of a linear statistical model and then applies a non-linear function to the linear model output. For the first layer - which takes the network input $x$ as input - these two steps look like this
 $$
 \begin{align}
 a^{(1)}_j & = \sum_{i=1}^D w^{(1)}_{ji} x_i + w^{(1)}_{j0} \\
 z^{(1)}_j & = h_1(a^{(1)}_j ) \ ,
 \end{align}
 $$
 where $h_1$ is the non-linear function in the first layer. We can get rid of writing the so-called bias $w^{(1)}_{j0}$ explicitly by adding an extra input $x_0$ which is always set to one and extending the sum to go from zero:
 $$
 \begin{align}
 a^{(1)}_j & = \sum_{i=0}^D w^{(1)}_{ji} x_i \ .
 \end{align}
 $$
 We will use this notation in the following for all layers and just write for example $\sum_i \ldots$ to be understood as $\sum_{i=0}^D \ldots$.
 The second layer takes the output the of the first layer as input:
 $$
 a^{(2)}_j = \sum_{i}^M w^{(2)}_{ji} z^{(1)}_i \ .
 $$
 The second layer non-linear function is denoted by $h_2$ so the output of the network is
 $$
 y_j = h_2(a^{(2)}_j)
 $$
 This gives an example of how the neural network model input to output mapping can be specified.
 %% Cell type:markdown id: tags:
 Let us do a few exercise where you complete the equations by replacing $\ldots$ with the correct expressions.
 ## Exercise a)
 a) Write $y_j$ directly as a function of $x$. That is, eliminate the $a$s and $z$s:
 $$
 y_j = h_2(\ldots)
 $$
 %% Cell type:markdown id: tags:
 ## Answer a)
 $$
 y_j = h_2(a_j^{(2)})\\
 = h_2({\sum_i^M w^{(2)}_{ji}z_i^{(1)}})\\
 = h_2({\sum_i^M w^{(2)}_{ji}h_1(a_j^{(1)})})\\
-= h_2({\sum_i^M w^{(2)}_{ji}h_1(\sum_{i=1}^D w_{ji}^{(1)}x_i+w_{j0}^{(1)})})
+= h_2({\sum_i^M w^{(2)}_{ji}h_1(\sum_{i=0}^D w_{ji}^{(1)}x_i)})
 $$
 %% Cell type:markdown id: tags:
 ## Exercise b)
 b) Write the equation for a neural network with two hidden layers and three layers of weights $w=w^{(1)},w^{(2)},w^{(3)}$. Again, without using $a$s and $z$s.
 $$
 y_j = h_3(\ldots)
 $$
 %% Cell type:markdown id: tags:
 ## Answer b)
 $$
 y_j = h_3(\sum_{i=1}^K w^{(3)}_{ji}z_i^{(2)}) \\
- = h_3(\sum_{i=1}^K w^{(3)}_{ji}h_2({\sum_i^M w^{(2)}_{ji}h_1(\sum_{i=1}^D w_{ji}^{(1)}x_i+w_{j0}^{(1)})}))
+ = h_3(\sum_{i=1}^K w^{(3)}_{ji}h_2({\sum_i^M w^{(2)}_{ji}h_1(\sum_{i=0}^D w_{ji}^{(1)}x_i)}))
 $$
 %% Cell type:markdown id: tags:
 ## Exercise c)
 c) Write the equations for an FFNN with $L$ layers as recursion. Use $l$ as the index for the layer:
 $$
 \begin{align}
 y_j & = \ldots & \\
 z^{(l)}_j & = h_l(\ldots) & l=1,\ldots,L \\
 a^{(l)}_j & = \ldots &  l=2,\ldots,L \\
 a^{(1)}_j & = \ldots &
 \end{align}
 $$
 %% Cell type:markdown id: tags:
 ## Answer c)
 $$
 \begin{align}
 y_j = z_j^{(L)}\\
 z_j^{l} = h_l(a_j^{(l)})  & & l=1,\ldots,L \\
 a_j^{(l)} = \sum_{i=1}^M w_{ji}^{(l)} x_i & & l=2,\ldots,L \\
-a^{(1)}_j = \sum_{i=1}^D w^{(1)}_{ji} x_i + w^{(1)}_{j0} \\
+a^{(1)}_j = \sum_{i=0}^D w^{(1)}_{ji} x_i \\
 \end{align}
 $$
 %% Cell type:markdown id: tags:
 ## Exercise d) optional
 Optional exercises do not give extra credits but are included to give you an opportunity to dive a bit deeper. You can omit them for the first run through the lab and then return to them later on when the other exercises are completed.
 d) Do we really need the non-linearities? Show that if we remove the non-linear functions $h_l$ from the expressions above then the output becomes linear in $x$. This means that the model collapses back to the linear model and therefore cannot learn non-linear relations between $x$ and $y$.
 %% Cell type:markdown id: tags:
 ## Answer d)
 I'd do it after the implementation of BPNN.
 %% Cell type:markdown id: tags:
 # Loss functions and maximum likelihood
 The FFNN model is quite flexible and can learn to approximate all sorts of functions from training data. Below are some examples of one-dimensional functions learned with a FFNN - i.e. we  try to learn a function (the neural network) from a set of points $(x,y)$ that can predict $y$ from input $x$ with as high accuracy as possible. The FFNN we use below consists of one hidden layer with three hidden units, $\tanh$ as the non-linear function in the hidden layer and linear output:
 $$
 y(x) = \sum_{i=0}^3 w^{(2)}_i \tanh( w^{(1)}_i x )
 $$
 The traning data is visualised as the blue dots, the learned function $y(x)$ (only displayed in the interval where we have training data) is the full red line and the dashed lines are the output from the hidden units.
 A very nice website to get intuition about how neural networks fit to data is the [TensorFlow playground](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4&seed=0.48021&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=regression&initZero=false&hideText=false). The example shown is two dimensional regression: the input $x$ is two-dimensional and the output $y$ is one-dimensional and continuous. Blue encodes high values and yellow low values. You can change many things including switching to classification tasks in the top menu to the right and to other more challenging problems to the left.
 %% Cell type:code id: tags:
 ``` python
 # Figure 5.3 from Bishop
 from IPython.display import display
 display(Image("figures/Figure5.3a.jpg",width=400),\
        Image("figures/Figure5.3b.jpg",width=400),\
        Image("figures/Figure5.3c.jpg",width=400),\
        Image("figures/Figure5.3d.jpg",width=400))
 ```
 %% Output
 %% Cell type:markdown id: tags:
 This gives some insight into how the neural network uses the hidden units to approximate the curve. But we need to formalise learning and fitting to data. Here the concept of a loss/error function and later on maximum likelihood will come in handy.
 The training set $\mathcal{D}=\{ (\mathbf{x}_n,\mathbf{t}_n)|n=1,\ldots,N\}$ is comprised of $N$ input-output pairs. In the plots above these are all the blue dots.
 We can define a loss or error function $E(w)$ as some measure of how far $y(x)$ is from the training data, that is the difference between the red line and the blue dots. As above, $w$ denotes the set of weights of the FFNN, which are the parameters of the neural network we are trying to learn - i.e. we wish to determine a set of weights that results in a low value of the loss function. This is the training error as it is only computed for the training points. In literature sometimes $L$ or $J$ will be used instead of $E$.
 In machine learning we are very much concerned with what the model predicts for inputs which are not in the training set. We often talk about the generalisation error which measures the loss function on data points sampled randomly from the data distribution. In general we do not know the data distribution so instead we set aside a subset of the data we have - the test set - to compute an estimate of the generalisation error. An important skill to have in machine learning is to perform the optimization of the model in such a way that it generalizes well. Much more about that in the coming weeks...
 An example of a loss function is the sum of squares:
 $$
 E(w) = \sum_{n=1}^N || \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2^2 \ ,
 $$
 where $|| \ldots ||_2^2$ is shorthand for the sum of squared components.
 This sum of squares error function is appropiate for regression (= targets are continuous) as in the example above. It is not useful for classification where targets are discrete class labels. So in order to come up with a proper loss function for all cases we will appeal to the maximum likelihood. The likelihood is defined as the probability of the data - in supervised learning the target $\mathbf{t}_1,\ldots,\mathbf{t}_N$ - given the model parameters and the inputs:
 $$
 p(\mathbf{t}_1,\ldots,\mathbf{t}_N|\mathbf{x}_1,\ldots,\mathbf{x}_N,w) \ .
 $$
 Maximum likelihood simply says that we should use the set of parameters $w$ that assigns the highest possible probability to the observed targets $\mathbf{t}_1,\ldots,\mathbf{t}_N$.
 In order to go further from here we need to make some assumptions:
 1. Unless we are working with series data - such as time series - we will assume that each data point is independently sampled and each data point is sampled from the same distribution (also known as iid = Independent and identically distributed):
 $$
 p(\mathbf{t}_1,\ldots,\mathbf{t}_N|\mathbf{x}_1,\ldots,\mathbf{x}_N,w) =
 \prod_{n=1}^N p(\mathbf{t}_n|\mathbf{x}_n,w) \ .
 $$
 2. To connect with the sum of squares loss we will assume that the target can be written as the model output plus some random error $\mathbf{\epsilon}$ and that $\mathbf{\epsilon}$ has a Gaussian distribution with zero mean and covariance $\sigma^2 \mathbf{I}$. This means that targets themselves are Gaussian distributed with mean $\mathbf{y}(\mathbf{x})$ and covariance $\sigma^2 \mathbf{I}$, where $\mathbf{I}$ is the identity matrix. Denoting the Gaussian distribution by $\mathcal{N}$ we can write this as:
 $$
 p(\mathbf{t}_n|\mathbf{x}_n,w) = \mathcal{N}(\mathbf{t}_n|\mathbf{y}(\mathbf{x}_n),\sigma^2 \mathbf{I}) \ .
 $$
 %% Cell type:markdown id: tags:
 ## Exercise e)
 e) In this exercise you will show that with the two above assumptions we can derive a loss function that contains $E(w)$ as a term. Two hints
 1. With the used covariance we can write the Gaussian distribution as
 $$
 \mathcal{N}(\mathbf{t}_n|\mathbf{y}(\mathbf{x}_n),\sigma^2 \mathbf{I}) = \frac{1}{\sqrt{2\pi \sigma^2}^D}
 \exp ( - || \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2^2 /2\sigma^2 )
 $$
 2. In order to turn maximum likelihood into a loss/error function apply the (natural) logarithm to the likelihood objective and multiply by minus one.
 Show that the loss we get is
 $$
 \frac{ND}{2} \log 2\pi \sigma^2 + \frac{1}{2\sigma^2} E(w) \ .
 $$
 Further argue why applying the log and multiplying by minus one is the right thing to do in order to get a loss function. *Hint:* Will the optimum of the likelihood function change if we apply the logarithm?
 %% Cell type:markdown id: tags:
 ## Answer e)
 For the equation
 $$
 p(\mathbf{t}_1,\ldots,\mathbf{t}_N|\mathbf{x}_1,\ldots,\mathbf{x}_N,w) =
 \prod_{n=1}^N p(\mathbf{t}_n|\mathbf{x}_n,w)
 $$
 , since
 $$
 \mathcal{N}(\mathbf{t}_n|\mathbf{y}(\mathbf{x}_n),\sigma^2 \mathbf{I}) = \frac{1}{\sqrt{2\pi \sigma^2}^D}
 \exp ( - || \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2^2 /2\sigma^2 )
 $$
 and
 $$
 p(\mathbf{t}_n|\mathbf{x}_n,w) = \mathcal{N}(\mathbf{t}_n|\mathbf{y}(\mathbf{x}_n),\sigma^2 \mathbf{I}) \ .
 $$
 , we can calculate the loss by:
 $$
 loss = -log(p(\mathbf{t}_1,\ldots,\mathbf{t}_N|\mathbf{x}_1,\ldots,\mathbf{x}_N,w))\\
 = -log(\prod_{n=1}^N p(\mathbf{t}_n|\mathbf{x}_n,w))\\
 = -log(\frac{1}{\sqrt{2\pi \sigma^2}^{DN}}
 \exp ( - \sum_{n=1}^N|| \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2^2 /(2\sigma^2) ))\\
 = \frac{DN}{2}\cdot log 2\pi \sigma^2 + \sum_{n=1}^N|| \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2^2 /(2\sigma^2)\\
 $$
 According to the question, we can know that
 $$
 E(w) = \sum_{n=1}^N || \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2^2
 $$
 Then we have:
 $$
 loss = \frac{ND}{2}\cdot log 2\pi \sigma^2 + \frac{1}{2\sigma^2}E(w)
 $$
 Let's consider a more general case. Denoting the likelyhood function $p(\mathbf{t}_1,\ldots,\mathbf{t}_N|\mathbf{x}_1,\ldots,\mathbf{x}_N,w)$ as $p$, the loss is define as:
 $$
 loss = -log(p)
 $$
 It's clear that the maximum likelihood is linked with the minima of loss. That is, the larger $p$ is, the less loss it has. This accord with our intuitive sense. When the lieklihood gets closer to 0, the increasing speed of the loss shall be higher and higher. It effectively helps the network discriminate low and high $p$ value.
 %% Cell type:markdown id: tags:
 ## Exercise f) optional
 f) Show that the optimum (= minimum of the loss) with respect to $w$ is not affected by the value of $\sigma^2$. Find the optimum for $\sigma^2$ as a function of $w$.
 This means that for the problem of finding $w$, sum of squares and maximum likelihood for the Gaussian likelihood with $\sigma^2 \mathbf{I}$ covariance are equivalant.
 %% Cell type:markdown id: tags:
 ## Answer f)
 Let's assume the loss is differentiable.\\
 To get the minimum of the loss, we need to set partial differention to be 0.
 $$
 \frac{\partial loss}{\partial w} = \frac{1}{2\sigma^2}\cdot \frac{\partial E(w)}{\partial w}
 $$
 When $\frac{\partial E(w)}{\partial w} = 0$, the $w$ happens to make $\frac{\partial loss}{\partial w}=0$. That is, for the problem of finding $w$, sum of squares and maximum likelihood for the Gaussian likelihood with $\sigma^2 \mathbf{I}$ covariance are equivalant. And the optimum is irrelevant with $\sigma^2$
 %% Cell type:markdown id: tags:
 ## Classification, one-hot and softmax
 We will now turn to classification. In classication the $K$-dimensional target vector $\mathbf{t}$ encodes exactly one out of $K$ possible classes. It is convenient to use the so-called one-hot encoding for this meaning that the $\mathbf{t}$ vector will have $K-1$ zeros and a single one at position $k$ if the target for the data point is class $k$. For example, if we have $K=4$ and the correct label is class three then $\mathbf{t}=(0,0,1,0)$.
 We need to modify the network output $\mathbf{y}(\mathbf{x})$ in order to get a likelihood function. In the example above the likelihood should be the probability that the model assigns to class three. This means that the output should be probabilities. We can achieve this with the softmax function:
 $$
 y_k = \frac{\exp ( a_k )}{\sum_j \exp ( a_j )} \ ,
 $$
 where $a_j$ is shorthand for the linear output from the last layer: $a^{(L)}_j$. The $a$ vector is what in statistics is called the multinomial link function and they are also called the logits. Combining the one-hot encoding of the target with the probabilistic model outputs we can write:
 $$
 p(\mathbf{t}_n|\mathbf{x}_n,w) = \prod_{k=1}^K \left[ y_k(\mathbf{x}_n)\right]^{t_{nk}} \ .
 $$
 We can now see that the one-hot encoding selects exactly the right output term and the other terms with $t_{nj}=0$ with contribute ones because $[y_{nj}]^0=1$.
 %% Cell type:markdown id: tags:
 ## Exercise g)
 g) Show using the same procedure we used for regression that the loss function for classification is:
 $$
 E(w) = - \sum_{n=1}^N \sum_{k=1}^K t_{nk} \log y_{k}(\mathbf{x}_n) \ .
 $$
 This is also known as the cross entropy loss.
 %% Cell type:markdown id: tags:
 ## Answer g)
 Since
 $$
 p(\mathbf{t}_n|\mathbf{x}_n,w) = \prod_{k=1}^K \left[ y_k(\mathbf{x}_n)\right]^{t_{nk}}
 $$
 and
 $$
 p(\mathbf{t}_1,\ldots,\mathbf{t}_N|\mathbf{x}_1,\ldots,\mathbf{x}_N,w) =
 \prod_{n=1}^N p(\mathbf{t}_n|\mathbf{x}_n,w)
 $$
 , we can design a loss function as follows.
 $$
 E(w) = -log(\prod_{n=1}^N \prod_{k=1}^K \left[ y_k(\mathbf{x}_n)\right]^{t_{nk}})\\
 = -\sum_{n=1}^N log(\prod_{k=1}^K \left[ y_k(\mathbf{x}_n)\right]^{t_{nk}})\\
 = -\sum_{n=1}^N \sum_{k=1}^K log(\left[ y_k(\mathbf{x}_n)\right]^{t_{nk}})\\
 = - \sum_{n=1}^N \sum_{k=1}^K t_{nk} \log y_{k}(\mathbf{x}_n)
 $$
 %% Cell type:markdown id: tags:
 # Stochastic gradient descent
 To paraphrase Stan Lee: With great flexibility comes complicated fitting to data! We therefore have to resort to general purpose optimization. Optimization refers to the process of searching for a set of weights $w$ that achieves some good results, such as a low loss on the training set. The loss function is differentiable so we can use gradient information to guide our search. In deep learning some variant of stochastic gradient descent is almost always used for optimization when the gradients can be computed. Stochastic here means that the gradient is computed on a subset of the training set.
 Basic gradient descent means that we take a step with step-size $\eta$ opposite the parameter gradient direction:
 $$
 w^{\mathrm{new}} = w - \eta \nabla E(w) \ ,
 $$
 where $\nabla$ is the parameter gradient operator. Applying $\nabla$ to a scalar function like $E(w)$ will produce a vector as output where the $j$th component of the vector is the derivative of $E(w)$ with respect to the $j$th parameter. A conceptional sketch of the optimization problem is given in the figure below.
 %% Cell type:code id: tags:
 ``` python
 # Bishop figure 5.5
 Image("figures/Figure5.5.jpg",width=500)
 ```
 %% Output
    <IPython.core.display.Image object>
 %% Cell type:markdown id: tags:
 ## Exercise h) optional
 Prove that the gradient calculated on a random subset of the training set on average is proportional to the true gradient.
 %% Cell type:markdown id: tags:
 ## Answer h)
 $$
 E(w) = \sum_{n=1}^N || \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2^2
 $$
 $$
-y(x) = \sum_{i=0}^3 w^{(2)}_i \tanh( w^{(1)}_i x )
+\nabla E(w) = \sum_{n=1}^N 2|| \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2\cdot \nabla  \mathbf{y}(\mathbf{x}_n)
+$$
+For $M$ randomly selected samples in the training set, To be more clear, we assume the $M$ points to be the first $M$ points in the set. In reality it can be in anywhere in the training set.
+$$
+\nabla E_s(w) = \sum_{n=1}^M 2|| \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2\cdot \nabla  \mathbf{y}(\mathbf{x}_n)
+$$
+\
+$$
+\nabla E(w) = \sum_{n=1}^M 2|| \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2\cdot \nabla  \mathbf{y}(\mathbf{x}_n) + \sum_{n=M+1}^{N-M} 2|| \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2\cdot \nabla  \mathbf{y}(\mathbf{x}_n)\\
+ = \nabla E_s(w) + \sum_{n=M+1}^{N-M} 2|| \mathbf{y}(\mathbf{x}_n) - \mathbf{t}_n||_2\cdot \nabla  \mathbf{y}(\mathbf{x}_n)
 $$
+It can be seen from the equation below that, as $M$ is proportional to $N$, the gradient $\nabla E_s(w)$ calculated on a random subset of the training set on average is proportional to the true gradient $\nabla E(w)$.
 %% Cell type:markdown id: tags:
 # Error backpropagation
 So-called error backpropagation is simply a recipe for calculating gradients of layered models such as neural networks. Efficient computation is based upon 1) the [chain-rule of differentiation](https://en.wikipedia.org/wiki/Chain_rule):
 $$
 \frac{\partial f(g(w))}{\partial w} = \left. \frac{\partial f(g)}{\partial g} \right|_{g=g(w)} \frac{\partial g(w)}{\partial w}
 $$
 and 2) storing intermediate computations. In the following we will omit writing $g=g(w)$ and instead let it be understood from the context.
 ## Gradient for layer $L$
 We return to the FFNN with $L$ layers. We start by computing the gradient with respect to a weight in the last layer:
 $$
 \frac{\partial E(w)}{\partial w^{(L)}_{ji}} \ .
 $$
 To make life a bit simpler for ourselves we will assume that our training set consists of one example so we can drop the training point index $n$. (As an optional exercise below you can put the summation over the training examples back in.)
 First we observe the $w$ dependence in $E(w)$ is through $a^{(L)}_1,\ldots,a^{(L)}_K$ so we can now apply the chain-rule
 $$
 \frac{\partial E(w)}{\partial w^{(L)}_{ji}} = \sum_{k=1}^K \frac{\partial E(w)}{\partial a^{(L)}_{k}} \frac{\partial a^{(L)}_{k}}{\partial w^{(L)}_{ji}} \ .
 $$
 We can use that $a^{(L)}_{k} = \sum_i w^{(L)}_{ki} z^{(L-1)}_i$ to conclude that $\frac{\partial a^{(L)}_{k}}{\partial w^{(L)}_{ji}} = z^{(L-1)}_i$ when $j=k$ and zero otherwise. So we can write
 $$
 \frac{\partial E(w)}{\partial w^{(L)}_{ji}} = \delta^{(L)}_j  z^{(L-1)}_i \ ,
 $$
 where we with foresight have defined
 $$
 \delta^{(L)}_j = \frac{\partial E(w)}{\partial a^{(L)}_{j}} \ .
 $$
 These $\delta$s will become very practical for bookkeeping purposes.
 %% Cell type:markdown id: tags:
 ## Exercise i)
 Calculate
 $$
 \delta^{(L)}_j = \frac{\partial E(w)}{\partial a^{(L)}_{j}}
 $$
 for classification.
 Hint: It is much easier to find the derivative if we write the loss function directly in terms of the logits $a^{(L)}_{j}$. So show first using the definition of the loss and the softmax that
 $$
 E(w) = - \sum_{k=1}^K t_k a^{(L)}_{k} + \log \sum_{k=1}^K \exp( a^{(L)}_{k} ) \ .
 $$
 Finally show that
 $$
 \frac{\partial}{\partial a^{(L)}_{j}} \log \sum_{k=1}^K \exp( a^{(L)}_{k} ) = y_j
 $$
 to get the final result.
 %% Cell type:markdown id: tags:
+## Answer i)
+Considering only one sample, for classification problem,
+$$
+E(w) = -\sum_{k=1}^K log{t_k}y_k\\
+$$
+$$
+\delta_j^{(L)} = \frac{\partial E(w)}{\partial a_j^{(L)}}\\
+ =  \frac{\partial E(w)}{\partial y_k} \cdot  \frac{\partial y_k}{\partial a_j^{(L)}}\\
+ = -\sum_{k=1}^K \frac{t_k}{y_k} \cdot \frac{\partial y_k}{\partial a_j^{(L)}}
+$$
+Let's discuss $ \frac{\partial y_k}{\partial a_j^{(L)}}$.
+$$
+y_k = \frac{\exp ( a_k^{(L)} )}{\sum_i \exp ( a_i^{(L)} )}
+$$
+$$
+\frac{\partial y_k}{\partial a_j^{(L)}} = \frac{\frac{\partial exp(a_k^{(L)})}{\partial a_j^{(L)}}\cdot \sum_i exp(a_i^{(L)}) - exp(a_k^{(L)})\cdot exp(a_j^{(L)})}{(\sum_i exp(a_i^{(L)})^2}
+$$
+There are two different cases.\
+When $k=j$:
+$$
+\frac{\partial y_k}{\partial a_j^{(L)}} = \frac{exp(a_k^{(L)})\cdot \sum_i exp(a_i^{(L)}) - exp(a_k^{(L)})^2}{(\sum_i exp(a_i^{(L)})^2}\\
+ = y_k -y_k^2\\
+ = y_k(1-y_k)
+$$
+When $k\neq j$:
+$$
+\frac{\partial y_k}{\partial a_j^{(L)}} = \frac{- exp(a_k^{(L)})\cdot exp(a_j^{(L)})}{(\sum_i exp(a_i^{(L)})^2}\\
+ = -y_k\cdot y_j
+$$
+When taking $ \frac{\partial y_k}{\partial a_j^{(L)}}$ into $\delta_j^{(L)}$, there is:
+$$
+\delta_j^{(L)} =  \sum_{k=1}^K f_k
+$$
+$$
+f_k = t_k(y_j-1) \qquad when \qquad k=j\\
+f_k = t_k\cdot y_j \qquad when \qquad k\neq j
+$$
+Considering $t_k$ is in one-hot format,
+$$
+\delta_j^{(L)} =  y_j-1 \qquad when \qquad j = \textbf{k}\\
+\delta_j^{(L)} =  y_j \qquad when \qquad j \neq \textbf{k}
+$$
+Here, $\textbf{k}$ means that the point is in class $\textbf{k}$. Note that $\textbf{k}$ is different from the indice counter $k$ I used before.
+It's easy for us to notice that, $\delta_j^{(L)} = y_j - t_s$, where $t_s$ is the supposed output of the $j$-th neutron.
+%% Cell type:markdown id: tags:
 ## Gradient for layer $L-1$
 Now we proceed to the second last layer
 $$
 \frac{\partial E(w)}{\partial w^{(L-1)}_{ji}} \ .
 $$
 This will allow us to identify a pattern that will enable us to generalize to any layer.
 We now use that the model is layered so the chain-rule we need is:
 $$
 \frac{\partial E(w)}{\partial w^{(L-1)}_{ji}} = \sum_{k=1}^K \frac{\partial E(w)}{\partial a^{(L)}_{k}} \frac{\partial a^{(L)}_{k}}{\partial a^{(L-1)}_{j}} \frac{\partial a^{(L-1)}_{j}}{\partial w^{(L-1)}_{ji}} \ .
 $$
 We can now again use $a^{(L)}_{k} = \sum_i w^{(L)}_{ki} z^{(L-1)}_i= \sum_i w^{(L)}_{ki} h_{L-1}(a^{(L-1)}_i)$ and the definition of $\delta^{(L)}_j$:
 $$
 \frac{\partial E(w)}{\partial w^{(L-1)}_{ji}} = \sum_{k=1}^K \delta^{(L)}_k  w^{(L)}_{kj} h_{L-1}'(a^{(L-1)}_j) z^{(L-2)}_i \ .
 $$
 We can now define
 $$
 \delta^{(L-1)}_j = \sum_{k=1}^K \delta^{(L)}_k  w^{(L)}_{kj} h_{L-1}'(a^{(L-1)}_j) \ .
 $$
 To write the gradient in the same way as for layer $L$
 $$
 \frac{\partial E(w)}{\partial w^{(L-1)}_{ji}} = \delta^{(L-1)}_j \ z^{(L-2)}_i \ .
 $$
 If we look a bit closer at the definition of $\delta^{(L-1)}_j$ we see it is actually nothing but:
 $$
 \delta^{(L-1)}_j = \frac{\partial E(w)}{\partial a^{(L-1)}_{j}} \ .
 $$
 You will use this in the next exercise to derive the backpropagation rule for a general layer $l<L$.
 %% Cell type:markdown id: tags:
 ## Exercise j) - the backpropagation rule for layer $l<L$
 j) Use the above results to argue why the general backpropagation rule for $l<L$ is written as:
 $$
 \begin{align}
 \frac{\partial E(w)}{\partial w^{(l)}_{ji}} & = \delta^{(l)}_j  z^{(l-1)}_i \\
 \delta^{(l)}_j & = \sum_{k=1}^K \delta^{(l+1)}_k  w^{(l+1)}_{kj} h_{l}'(a^{(l)}_j)
 \end{align}
 $$
 %% Cell type:markdown id: tags:
+## Answer j)
+### Proof 1: $\frac{\partial E(w)}{\partial w^{(l)}_{ji}} = \delta^{(l)}_j  z^{(l-1)}_i$
+$$
+\frac{\partial E(w)}{\partial w^{(l)}_{ji}} = \sum_{k=1}^K \frac{\partial E(w)}{\partial a_k^{(l)}}\cdot \frac{\partial a_k^{(l)}}{\partial w^{(l)}_{ji}}\\
+$$
+Since
+$$
+a_k^{(l)} = \sum_i w_{ki}^{(l)} \cdot z_i^{(l-1)}
+$$
+, we can conclude that, if and only if $k=j$, $\frac{\partial a_k^{(l)}}{\partial w_{ji}^{(l)}}\neq 0$, and
+$$
+\frac{\partial a_k^{(l)}}{\partial w_{ji}^{(l)}} = z_i^{(l-1)}
+$$
+Then
+$$
+\frac{\partial E(w)}{\partial w^{(l)}_{ji}} = \frac{\partial E(w)}{\partial a_j^{(l)}}\cdot z_i^{(l-1)}\\
+ = \delta^{(l)}_j  z^{(l-1)}_i
+$$
+### Proof 2: $\delta^{(l)}_j = \sum_{k=1}^K \delta^{(l+1)}_k  w^{(l+1)}_{kj} h_{l}'(a^{(l)}_j)$
+$$
+\delta^{(l)}_j = \frac{\partial E(w)}{\partial a_j^{(l)}}\\
+ = \sum_{k=1}^K \frac{\partial E(w)}{\partial a_k^{(l+1)}}\cdot \frac{\partial a_k^{(l+1)}}{\partial a_j^{(l)}}\\
+ = \sum_{k=1}^K \frac{\partial E(w)}{\partial a_k^{(l+1)}}\cdot \frac{\partial (\sum_i w_{ki}^{(l+1)}\cdot z_i^{(l)})}{\partial a_j^{(l)}}\\
+ = \sum_{k=1}^K \frac{\partial E(w)}{\partial a_k^{(l+1)}}\cdot \frac{\partial (\sum_i w_{ki}^{(l+1)}\cdot h_l(a_j^{(l)}))}{\partial a_j^{(l)}}\\
+ = \sum_{k=1}^K \frac{\partial E(w)}{\partial a_k^{(l+1)}}\cdot \frac{w_{kj}^{(l+1)}\cdot \partial h_l(a_j^{(l)})}{\partial a_j^{(l)}}\\
+ = \sum_{k=1}^K \delta^{(l+1)}_k  w^{(l+1)}_{kj} h_{l}'(a^{(l)}_j)
+$$
+%% Cell type:markdown id: tags:
 ## Exercise k) - optional
 k) Derive the backpropagation rule for regression. That is perform the calculation in exercise i) for the regression loss. Everything else stays the same. Actually it turns out that is you use $h_L(a) = a$ then you even get the same result as in i).
 %% Cell type:markdown id: tags:
+## Answer k)
+For regression, take the sum of squares for example.
+$$
+E(w) = \frac{1}{2}|| \mathbf{y}(\mathbf{x}) - \mathbf{t}||_2^2\\
+ =  \frac{1}{2} \sum_{k=1}^K (y_k-t_k)^2
+$$
+$\frac{1}{2}$ here is only used to scale the loss function for convenience. It won't affect the result.
+$$
+\delta_j^{(L)} = \frac{\partial E(w)}{\partial a_j^{(L)}}\\
+ = (y_j-t_j) \cdot \frac{\partial h_L(a_j^{(L)})}{\partial a^{(L)}}
+$$
+It's apperant that when $h_l(a)=a$, we get the same result as in i), that is, $\delta_j^{(L)} = y_j-t_j$, where $t_j$ is the supposed output of the $j$-th neuron.
+%% Cell type:markdown id: tags:
 ## Exercise l)
 l) Modify the backpropagation rules to be able to take a dataset of size greater than one.
 Hint: This only requires introducing a sum over $n$ and $n$ indices in a few places.
 %% Cell type:markdown id: tags:
+## Answer l)
+When there are $N$ samples in a dataset.
+$$
+E(w) = -\sum_{n=1}^N \sum_{k=1}^K log(t_{nk}y_k(x_n))\\
+= -\sum_{n=1}^N log(y_\textbf{k}(x_n))
+$$
+where $\textbf{k}$ means the class of $x_n$ in ground truth.It's different from $k$.
+\
+$$
+y_{\textbf{k}}(x_n) = \frac{exp(a_{n\textbf{k}}^{(L)})}{\sum_i exp(a_{ni}^{(L)})}
+$$
+Let's assume
+$$
+\delta_j^{(L)} = \sum_{n=1}^N \frac{\partial E(w)}{\partial a_{nj}^{(L)}}\\
+ = \sum_{n=1}^N \frac{\partial E(w)}{\partial y_{\textbf{k}}(x_n)} \cdot \frac{\partial y_{\textbf{k}}(x_n)}{\partial a_{nj}^{(L)}}\\
+ = -\sum_{n=1}^N \frac{1}{y_{\textbf{k}}(x_n)} \cdot  \frac{\frac{\partial exp(a_{n\textbf{k}}^{(L)})}{\partial a_{nj}^{(L)}}\cdot \sum_i exp(a_{ni}^{(L)}) - exp(a_{n\textbf{k}}^{(L)})\cdot exp(a_{nj}^{(L)})}{(\sum_i exp(a_{ni}^{(L)})^2}
+$$
+There are 2 cases.\
+When $j=\textbf{k}$,
+$$
+\delta_j^{(L)} = -\sum_{n=1}^N \frac{1}{y_{\textbf{k}}(x_n)} \cdot  \frac{exp(a_{n\textbf{k}}^{(L)})\cdot \sum_i exp(a_{ni}^{(L)} - exp(a_{n\textbf{k}}^{(L)})^2}{(\sum_i exp(a_{ni}^{(L)})^2}\\
+ = \sum_{n=1}^N (y_{\textbf{k}}(x_n)-1)
+$$
+When $j\neq \textbf{k}$,
+$$
+\delta_j^{(L)} = -\sum_{n=1}^N \frac{1}{y_{\textbf{k}}(x_n)} \cdot  \frac{- exp(a_{n\textbf{k}}^{(L)})\cdot exp(a_{nj}^{(L)})}{(\sum_i exp(a_{ni}^{(L)})^2}\\
+ = \sum_{n=1}^N y_{\textbf{k}}(x_n)
+$$
+From the results below, we can conclude that, a dataset of size greater than one doesn't result in a complicate $\delta_j^{(L)}$. It only introduces a sum over $n$ and $n$ indices in a few places. More specifically, it indicates that the differentiation over a dataset is independent for each sample. Thus, it's easy to modify the backproppagation rules. For example, if we choose average as the metric, the backpropagation rules will be revised as:
+$$
+\begin{align}
+\frac{\partial E(w)}{\partial w^{(l)}_{ji}} & = \frac{1}{N}\sum_{n=1}^N \delta^{(l)}_{nj}  z^{(l-1)}_{ni} \\
+\delta^{(l)}_{nj} & = \sum_{k=1}^K \delta^{(l+1)}_{nk}  w^{(l+1)}_{kj} h_{l}'(a^{(l)}_{nj})
+\end{align}
+$$
+where $\delta^{(l)}_{nj}$, $z^{(l-1)}_{ni}$, and $a^{(l)}_{nj}$ are exactly what we get in question i) and j).
+%% Cell type:markdown id: tags:
 # Final comments on backpropagation and what is next
 So optimizing a feed-forward neural network with gradient descent is pretty straightforward!
 1. You forward propagate with the rules you derived in exercise c), storing the results for the $a$s and $z$s.
 2. Then you run the backward propagation recursion for the $\delta$s taking $\delta^{(L)}_j$ from exercise i) and using the recursion from exercise j).
 3. You get the gradients from the expression in exercise j).
 3. Finally you update the weights of the network according to the gradients, and the network should now have better performance on the training set.
 If you wonder why the forward recursion for $a$ and $z$ is non-linear and the backward recursion for $\delta$ is linear then remember that the derivative is a [https://en.wikipedia.org/wiki/Operator_(mathematics)](linear operator).
 So what comes next? The next two weeks we will stay with the FFNN model. You will implement the forward and backward pass in NumPy yourself next week and in two weeks we will see how we can do this in PyTorch. One of the features of modern algebra frameworks like PyTorch and TensorFlow is that we only need to bother about the forward pass. We have finally "taught" computers to differentiate. ;-)
 In three weeks+ we will look at other generalisations of the FFNN. They all build upon the same principles as for the FFNN so the same insights apply.

--- a/02456/1_Feedforward_Pen_and_Paper/1.py
+++ b/02456/1_Feedforward_Pen_and_Paper/1.py