The central idea of the paper itself is simple and elegant. They take a
standard feed-forward ConvNet and add skip connections that bypass (or
shortcut) a few convolution layers at a time. Each bypass gives rise to a
residual block in which the convolution layers predict a residual that is added to the block's input tensor.
Although, Deep feed-forward conv nets tend to suffer from optimization difficulty (high training and high validation error). The residual network architecture solves this by adding shortcut
connections that are summed with the output of the convolution layers.
add the previous conv output 'x' (as residual) to the next output
H(x) = F(x)+x
= F(x)+Ix // multiplication with identity I-called identity mapping
If x is sufficient then F(.) will learn to weight the filters to zero. Otherwise learn to adjust weights to get optimal value.
Simply adding series of conv layers has large training error. 56-layer net has higher training error and test error than 20-layer net "Overly deep" plain nets have higher training error
Very simple design (series of fixed 3x3 conv layers)
Shortcut mapping is identity then forward pass additively propagates and Loss additively passes back as gradient (as opposed to multiplicative gradient propagation in other case)
what if shortcut mapping ℎ ≠ identity?
eg, conv(), xor, multiply with 0.5 etc increases the error
Keep the shortest path as smooth as possible by
forward/backward signals directly flow through this path
MRF is a generative model. Hence we need to model
i) the likelihood of image given label
ii) prior of label
the inference can be modeled from the joint probability (using Bayes theorm) as a conditional probability of label given the image.
To make the inference tractable only local relationship between labels are encoded into in the form ii).
CRF can directly model the conditional probability of label given image, hence we don't need to explicitly model i) and ii).