Reading:
Marr (1982), understanding an information processing system has three levels, called the levels of analysis:
"The brain is one hardware implementation for learning or pattern recognition."
There are mainly two paradigms for parallel processing:
"The problem now is to distribute a task over a network of such processors and to determine the local parameter values"
"Learning: We do not need to program such machines and determine the parameter values ourselves if such machines can learn from examples."
"Thus, artificial neural networks are a way to make use of the parallel hardware we can build with current technology and—thanks to learning— they need not be programmed. Therefore, we also save ourselves the effort of programming them."
"Keep in mind that the operation of an artificial neural network is a mathematical function that can be implemented on a serial computer"
ANNs are the method used in Deep Learning.
They are versatile, powerful, and scalable,
Ideal to tackle large and highly complex Machine Learning tasks, such as
The multi-arrow inputs imply a double weight input sufficient to activate post-synaptic neuron alone.
Operation
Most common step function used in Perceptrons is the Heaviside step function
$$ \text{heaviside}(z) = \begin{cases} 0, &\text{if $z<0$} \\ 1, &\text{if $z \ge 0$} \end{cases} $$
Sometimes the sign function is used instead.
$$ \text{sgn}(z) = \begin{cases} -1, &\text{if $z<0$} \\ 0, &\text{if $z=0$} \\ 1, &\text{if $z > 0$} \end{cases} $$
Currently new wave of interest, good reasons to believe that this one is different and will have a much more profound impact on our lives:
The basic processing element
All represent mathematical functions and their composition
Break computation into a tree of elementary operations ("ops") and the flow of variables from input to outputs
Write the mathematical function implemented by a perceptron
Picture decision surface for following 2-input cases...
A perceptron uses a linear decision boundary (a.k.a. decision surface)
An XOR requires a nonlinear decision boundary
"We can represent any Boolean function as a disjunction of conjunctions, and such a Boolean expression can be implemented by a multilayer perceptron with one hidden layer."
$x_1$ XOR $x_2$ = ($x_1$ AND ∼$x_2$) OR (∼$x_1$ AND $x_2$)
I.e. form function as composition of elementary functions then implement elementary functions with perceptrons
Composition of perceptrons = MLP
"for every input combination where the output is 1, we define a hidden unit that checks for that particular conjunction of the input. The output layer then implements the disjunction."
In other words, we can implement any truth table by making a perceptron that implements each row.
Generally an impractical approach (up to $2d$ hidden units may be necessary when there are $d$ inputs),
Proves universal abiility of MLP to describe any boolean function with a single hidden layer
A random truth table.
Suppose $x_1$, $x_2$, and $x_3$ are features of an email (e.g. contains a certain word or not)
$y$ is the prediction that the email is spam or not based on the features
$x_1$ | $x_2$ | $x_3$ | $y$ |
---|---|---|---|
0 | 0 | 0 | 1 |
0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | 1 |
1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
1 | 1 | 0 | 1 |
1 | 1 | 1 | 1 |
Proof of universal approximation with two hidden layers:
It has been proven that an MLP with one hidden layer (with an arbitrary number of hidden units) can learn any nonlinear function of the input (Hornik, Stinchcombe, and White 1989).
I.e. even a Single Hidden Layer can achieve arbitrary decision boundaries
However the future of Neural Networks turned out to be "deep" stacks of layers...