# Study notes for Sparse Coding

### Sparse Coding

• Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors $\phi_i$ such that an input vector $x\in \mathbb{R}^n (i.e., k>n)$ can be represented as a linear combination of these basis vectors:
$x=\sum_{i=1}^k a_i \phi_i$
• The advantage of having an over-complete basis is that our basis vectors $\phi_i$ are better able to capture structures and patterns inherent in the input data $x$
• However, with an over-complete basis, the coefficients are no longer uniquely determined by the input vector. Therefore, in sparse coding, we introduce the additional criterion of sparsity to resolve the degeneracy introduced by over-completeness.
• The sparse coding cost function is defined on a set of m input vectors as:
$\mbox{minimize}_{a_i^{(j)}, \phi_i} \sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k S(a_i^{(j)})$
where $S(.)$ is a sparsity function which penalizes $a_i$ for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $x$, and the second term as a sparsity penalty which forces our representation of $x$ (i.e., the learned features) to be sparse. The constant $\lambda$ is a scaling constant to determine the relative importance of these two contributions.
• Although the most direct measure of sparsity is the $L_0$ norm, it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost $S(.)$ are the $L_1$ penalty and the log sparsity$S(a_i)=log(1+a_i^2)$
• It is also possible to make the sparsity penalty arbitrary small by scaling down $a_i$ and scaling up $\phi_i$ by some large constant. To prevent this from happening, we will constrain $\|\phi\|^2$ to be less than some constant $C$. The full sparse coding cost function hence is:
$\mbox{minimize}_{a_i^{(j)}, \phi_i} \sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k S(a_i^{(j)})$
$\mbox{subject to } \|\phi_i\|^2\le C, \any i=1, \ldots, k$
where the constant is usually set $C=1$
• One problem is that the constraint cannot be forced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of $\phi_i$ small:
$\mbox{minimize}_{a_i^{(j)}, \phi_i} \sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k S(a_i^{(j)})+\gamma \|\phi\|_2^2$
• Another problem is that the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. We will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use $\sqrt{x^2+\epsilon}$ in place of $|x|$, where $\epsilon$ is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when $\epsilon$ is large compared to $x$, the $x+\epsilon$ is dominated by $\epsilon$, and taking the square root yield approximately $\sqrt{\epsilon}$.
• Hence, the final objective function is:
$J(\phi, A)=\sum_{j=1}^m \|x^{(j)}-\sum_{i=1}^k a_i^{(j)} \phi_i\|^2 + \lambda \sum_{i=1}^k \sqrt{a_i^2+\epsilon}+\gamma \|\phi\|_2^2$
• The set of basis vectors are called "dictionary" ($D$). $D$ is "adapted" to $x$ if it can represent it with a few basis vectors, that is, there exists a sparse vector $\alpha$ in $\mathbb{R}^p$ such that $x=D\alpha=\sum_{i=1}^p a_i d_i$. We call $\alpha$ the sparse code. It is illustrated as follows:

### Learning

• Learning a set of basis vectors $\phi$ using sparse coding consists of performing two separate optimizations (i.e., alternative optimization method):
• The first being an optimization over coefficients $a_i$ for each training example$x$
• The second being an optimization over basis vectors $\phi$ across many training examples at once.
• However, the classical optimization alternates between D and $\alpha$ can achieve good results, but very slow
• A significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time, especially compared to typical feed-forward architectures.

### Remarks

• From my view, due to the sparseness enforced in the dictionary learning (i.e., sparse code), the restored matrix is able to remove noise of original matrix, i.e., having some effect of denoising. Hence, Sparse coding could be used to denoise images.

## 开发专题

JavaScript成为网络霸主的“绯闻”

JavaScript正凭借新型工具与功能提升以极度夸张的速度吞噬整个世界。我们是否应该接受这一无法逆转的趋势？ 还记得那些旧日往事吗？很多用户因为担心安全问题而在浏览器中禁用JavaScript。如...

C++是垃圾语言？！

Linux之父对C++进行了炮轰，说它是糟糕程序员的垃圾语言，可谓是一石激起千层浪，引起众多程序员朋友的关注和讨论。本专题给出正反方的讨论观点，供大家评说！另外，给出了编程语言的发展状况...

MySQL数据库入门与精通

MySQL是一个跨平台的开源关系型数据库管理系统，目前MySQL被广泛地应用在Internet上的中小型网站中。由于其体积小、速度快、总体拥有成本低，尤其是开放源码这一特点，许多中小型网站为了降低...

• 24小时
• 一周
• 本月

• 本文暂无Tags标签