【凸优化算法】两种视角看待Mirror Descent
MotivationMirror Descent是对Gradient Descent的扩展。因此他的Motivation来源于Gradient Descent:
[*]Gradient Descent假设了f是关于二范数L-Lipschitz的,收敛速度受到L大小影响。那么如果f是其他范数Lipschitz的,可能直接用Gradient就不那么合适了。例如f是无穷范数L-Lipschitz的,则可以推出f是关于二范数 https://www.zhihu.com/equation?tex=%5Csqrt+nL -Lipschitz的。如果n非常大时,Gradient Descent收敛速度很慢。
[*]Gradient Descent的公式是 https://www.zhihu.com/equation?tex=x_%7Bt%2B1%7D+%3D+x_t+%2B+%5Ceta_t+%5Cnabla+f%28x_t%29 。如果从更抽象的角度,所在的是primal space,而 是在dual space。Gradient Descent相当于对primal space和dual space中的向量直接线性组合了。在带二范数的primal space中(也就是通常的情况)可以直接组合,因为2-norm的自对偶的。但是在其他范数下,不那么合理了。
[*]还有一种最简单的Motivation,请看下面的Proximal Point View。
Proximal Point View
Gradient Descent的https://www.zhihu.com/equation?tex=x%5E%7Bt%2B1%7D+%3D+x%5Et+%2B+%5Ceta_t+%5Cnabla+f%28x%5Et%29+ 等价于:
https://www.zhihu.com/equation?tex=x%5E%7Bt%2B1%7D+%3D+%5Carg%5Cmin_%7Bx%7D+%28f%28x%5Et%29+%2B++%5Cnabla+f%28x%5Et%29%28x-+x%5Et%29+%2B+%5Cfrac%7B1%7D%7B2++%5Ceta%7D%7C%7Cx-x%5Et%7C%7C%5E2%29+%5C%5C
其中, https://www.zhihu.com/equation?tex=f%28x%5Et%29+%2B++%5Cnabla+f%28x%5Et%29%28x-+x%5Et%29 是在处的线性近似项(linear approximation),我们希望小,所以也希望这个线性近似项尽可能小。我们又希望 https://www.zhihu.com/equation?tex=x%5E%7Bt%2B1%7D%E3%80%81x%5Et 能尽可能接近一点,否则线性近似效果是不好的。所以想到引入了一个regularizer,想把往方向拉近一点。这就是引入 https://www.zhihu.com/equation?tex=%5Cfrac%7B1%7D%7B2+%5Ceta%7D%7C%7Cx-x%5Et%7C%7C%5E2 (proximity term)的作用。
引入的表征的是2-norm,反映的是欧几里得空间。为了处理非欧空间、其他范数,我们还需要做一个泛化。我们不妨看两个例子:
Example 1. Quadratic Minimization
对于一个条件数很大的QP问题:
https://www.zhihu.com/equation?tex=%5Cmin+%5Cfrac%7B1%7D%7B2%7D+%28x-x%5E%2A%29%5ETQ%28x-+x%5E%2A%29+%2C+%5C+with+%5C+%5Cmathop%7B%5Ckappa%7D%3D%5Cfrac%7B%5Cmax_i+Q_%7Bi%2Ci%7D%7D%7B%5Cmin_i+Q_%7Bi%2Ci%7D%7D++%5Cgg+1%5C%5C
Q可以理解为各个坐标方向被scale差别很大,因此如果我们能够把改成 https://www.zhihu.com/equation?tex=%5Cfrac%7B1%7D%7B2++%5Ceta%7D%28x-x%5Et%29%5ETQ%28x-x%5Et%29 ,对坐标轴进行rescale一下,会更适合这个问题。
Example 2. Probability Simplex.
考虑一个概率单纯性问题:
https://www.zhihu.com/equation?tex=%5Cmin_%7Bx%5Cin+%5Ctriangle%7D+f%28x%29+%5C%5C
https://www.zhihu.com/equation?tex=+with+%5C+%5C+%5Ctriangle+%3A%3D+%5C%7B+x+%5Cin+R%5En_%2B%3A+%5Ctextbf%7B1%7D%5ETx+%3D+1+%5C%7D%5C%5C
如果将泛化为KL散度,更能够反映这种概率之间的距离。
所谓的Mirror Descent把 https://www.zhihu.com/equation?tex=%5Cfrac%7B1%7D%7B2%7D%7C%7Cx-x%5Et%7C%7C%5E2 用一个更广泛的来替代。(是Bregman Divergence,下面讨论其定义与性质)。
MD的迭代过程变为:
https://www.zhihu.com/equation?tex=x%5E%7Bt%2B1%7D+%3D+%5Carg%5Cmin_%7Bx%7D+%28f%28x%5Et%29+%2B++%5Cnabla+f%28x%5Et%29%28x-+x%5Et%29+%2B+%5Cfrac%7B1%7D%7B%5Ceta_t%7DD_%7B%5Cvarphi%7D%28x%2Cx%5Et%29%29%5C%5C
Bregman Divergence
定义. 是由严格凸、可导的生成的:
https://www.zhihu.com/equation?tex=D_%7B%5Cvarphi%7D%28x%2Cz%29+%3A%3D+%5Cvarphi%28x%29+-+%5Cvarphi%28z%29+-+%5Clangle+%5Cnabla+%5Cvarphi%28z%29%2Cx-z+%5Crangle+%5C%5C
我们既然用Bregmen Divergence来反映之间的距离,那么就要考虑三个问题:
1. Bregman Divergence能够刻画距离吗,比如 https://www.zhihu.com/equation?tex=x%3Dx_t 的时候,是否等于0。
2. Bregmen Divergence是凸的吗?如果不是凸的,那么 https://www.zhihu.com/equation?tex=%5Carg%5Cmin_%7Bx%7D+%28f%28x%5Et%29+%2B++%5Cnabla+f%28x%5Et%29%28x-+x%5Et%29+%2B+%5Cfrac%7B1%7D%7B%5Ceta_t%7DD_%7B%5Cvarphi%7D%28x%2Cx%5Et%29%29 将不是凸优化问题。
3. Bregman Divergance既然是泛化的,那么是否可以推导出Gradient Descent、Example 1、Example2这些具体情况?
下面回答第一个问题。由于凸的,因此 https://www.zhihu.com/equation?tex=D_%7B%5Cvarphi%7D%28x%2Cz%29+%3A%3D+%5Cvarphi%28x%29+-+%5Cvarphi%28z%29+-+%5Clangle+%5Cnabla+%5Cvarphi%28z%29%2Cx-z+%5Crangle+%5Cge+0 等号在 https://www.zhihu.com/equation?tex=x+%3Dz 处取。
对于第二个问题, https://www.zhihu.com/equation?tex=D_%7B%5Cvarphi%7D%28x%2Cz%29+ 在固定z的情况下,关于x是凸的。而 https://www.zhihu.com/equation?tex=+%5Cnabla+f%28x%5Et%29%28x-+x%5Et%29 是凸的(因为是线性函数所以凸),由此 https://www.zhihu.com/equation?tex=f%28x%5Et%29+%2B++%5Cnabla+f%28x%5Et%29%28x-+x%5Et%29+%2B+%5Cfrac%7B1%7D%7B%5Ceta_t%7DD_%7B%5Cvarphi%7D%28x%2Cx%5Et%29 是凸函数 https://www.zhihu.com/equation?tex=%28%5Ceta_t+%3E0%29 。
对第三个问题:
如果取 https://www.zhihu.com/equation?tex=%5Cvarphi%28x%29+%3D+%5Cfrac%7B1%7D%7B2%7D+%7C%7C+x%7C%7C%5E2 , https://www.zhihu.com/equation?tex=D_%7B%5Cvarphi%7D%28x%2Cx%5Et%29+%3D+%5Cfrac%7B1%7D%7B2%7D%7C%7C+x-x%5Et%7C%7C%5E2 ,对应的就是Gradient Descent。
如果取https://www.zhihu.com/equation?tex=%5Cvarphi%28x%29+%3D+%5Cfrac%7B1%7D%7B2%7D+x%5ETQx,https://www.zhihu.com/equation?tex=D_%7B%5Cvarphi%7D%28x%2Cx%5Et%29+%3D+%5Cfrac%7B1%7D%7B2%7D%28x-x%5Et%29%5ETQ%28x-x%5Et%29,对应的就是Example 1的泛化。
如果取 https://www.zhihu.com/equation?tex=%5Cvarphi%28x%29+%3D+ https://www.zhihu.com/equation?tex=%5Csum_%7Bi%3D1%7D%5En+x_i+%5Clog+x_i , https://www.zhihu.com/equation?tex=D_%7B%5Cvarphi%7D%28x%2Cx%5Et%29+%3D+-%5Csum_%7Bi%3D1%7D%5En+x_i+%5Clog+%5Cfrac%7Bx%5Et_i%7D%7Bx_i%7D ,对应的就是Example 2的泛化。
基于上面讨论,Mirror Descent在处处可导的情况下:
https://www.zhihu.com/equation?tex=x%5E%7Bt%2B1%7D+%5Cgets+%5Carg%5Cmin_%7Bx+%5Cin+K%7D+%28%5Ceta_t+%5Clangle+%5Cnabla+f%28x_t%29%2C+x%5Crangle+%2B+D_%7B%5Cvarphi%7D%28x%2C+x%5E%7Bt%7D%29%29+%5C%5C+%5Ctag%7B1%7D
如果有不可导的地方,就用次梯度来代替梯度。
Mirror Space View
前面在Motivation中谈到了primal space和dual space。一般认为,所在的是primal space,而所在的是dual space。
Nemirovski and Yudin的思路是:
[*]把primal space的通过mirror map映射到dual space中,记为https://www.zhihu.com/equation?tex=%5Ctheta%5Et 。
[*]在dual space上做梯度下降: https://www.zhihu.com/equation?tex=%5Ctheta%5E%7Bt%2B1%7D+%3D+%5Ctheta%5Et+-+%5Ceta_t+%5Cnabla+f%28x%5Et%29 。
[*]在把 https://www.zhihu.com/equation?tex=%5Ctheta%5E%7Bt%2B1%7D 映射回primal space,记为。
[*]如果优化问题具有约束集,将投影到约束集中,记为。
那这里的mirror map是什么呢?mirror map就是上一节中生成Bregman Divergence的 https://www.zhihu.com/equation?tex=%5Cvarphi%28x%2Cz%29 的梯度。
由此,Nemirovski and Yudin的思路写成算法(2)就是:
[*]Map to dual space:
[*]Gradient in dual space: https://www.zhihu.com/equation?tex=%5Ctheta%5E%7Bt%2B1%7D+%5Cgets+%5Ctheta%5Et+-+%5Ceta_t+%5Cnabla+f%28x%5Et%29
[*]Map back to primal space:
[*]Project into constraint: https://www.zhihu.com/equation?tex=x%5E%7Bt%2B1%7D+%5Cgets+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D+D_%7B%5Cvarphi%7D%28x%2C%5Cbar%7Bx%7D%5E%7Bt%2B1%7D%29
我们对上面这个算法讨论两点:
[*]步骤三中的一定可逆吗?或者说,这步的能求出来吗?
[*]算法(2)和Proximal Point View中的算法(1)是同一个吗?
回答第一个问题,对于一般的凸函数,很有可能无法求。因此我们要求是bijection的。即满足下面两个条件:
(i)在约束集K中是凸的,且处处可导。
(ii) 的对偶空间是 https://www.zhihu.com/equation?tex=R%5En ,即 https://www.zhihu.com/equation?tex=%5C%7B+%5Cnabla+%5Cvarphi%28x%29%3Ax+%5Cin+K+%5C%7D+%3D+R%5En 。
Remark. (i)保证了。(ii)保证了可解。
回答第二个问题。我们从算法(2)来推导出算法(1):
https://www.zhihu.com/equation?tex=x%5E%7Bt%2B1%7D+%3D+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D+D_%7B%5Cvarphi%7D%28x%2C%5Cbar%7Bx%7D%5E%7Bt%2B1%7D%29+++%5C%5C
https://www.zhihu.com/equation?tex=%3D+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D+%5Cvarphi%28x%29+-+%5Cvarphi%28%5Cbar%7Bx%7D%5E%7Bt%2B1%7D%29+-+%5Clangle+%5Cnabla++%5Cvarphi%28%5Cbar%7Bx%7D%5E%7Bt%2B1%7D%29%2C+x+-+%5Cbar%7Bx%7D%5E%7Bt%2B1%7D+%5Crangle+%5Cquad+%5Cquad+%28Definition+%5C+of+%5C+Bregmen+%5C+Divergence%29%5C%5C
https://www.zhihu.com/equation?tex=%3D+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D+%5Cvarphi%28x%29-+%5Clangle+%5Cnabla++%5Cvarphi%28%5Cbar%7Bx%7D%5E%7Bt%2B1%7D%29%2C+x+%5Crangle
https://www.zhihu.com/equation?tex=%3D+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D+%5Cvarphi%28x%29-+%5Clangle+%5Cnabla++%5Cvarphi%28x%5Et%29+-+%5Ceta_t+%5Cnabla+f%28x%5Et%29%2C+x+%5Crangle
https://www.zhihu.com/equation?tex=%3D+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D++%5Clangle+%5Ceta_t+%5Cnabla+f%28x%5Et%29%2C+x+%5Crangle+%2B+%5Cvarphi%28x%29-+%5Clangle+%5Cnabla++%5Cvarphi%28x%5Et%29+%2C+x+%5Crangle
https://www.zhihu.com/equation?tex=%3D+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D++%5Clangle+%5Ceta_t+%5Cnabla+f%28x%5Et%29%2C+x+%5Crangle+%2B+%5Cvarphi%28x%29+-+%5Cvarphi%28x%5Et%29-+%5Clangle+%5Cnabla++%5Cvarphi%28x%5Et%29+%2C+x++-x%5Et%5Crangle
https://www.zhihu.com/equation?tex=%3D+%5Carg+%5Cmin_%7Bx+%5Cin+K%7D++%5Clangle+%5Ceta_t+%5Cnabla+f%28x%5Et%29%2C+x+%5Crangle+%2BD_%7B%5Cvarphi%7D%28x%2Cx%5Et%29
由于等式都是等价了,因此也可以反过来通过(2)来推导出(1)。
至此Proximal Point View和Mirror Space View的结论是identical的。
读书笔记小结
这篇读书笔记主要讨论Proximal Point View和Mirror Space View两个视角下的Mirror Descent。
Mirror Descent的收敛性分析主要使用的Three Point theorem、Pythagotean theorem,得到一个关于Online Optimization的收敛结论,这个证明比较麻烦,笔记就不写了。
最后是我的一点感慨。Mirror Space View理解起来比较困难。本人的知识储备限制,里面涉及的很多例如Dual Norm, conjugate function、linear functional,这些东西虽然公式都能看懂,但是我还是感觉好抽象。感觉我还是不能站在一个更高的角度取思考这些问题,还要继续思考,继续啃书。 想问一下大神这个Mirror Space的观点可以指导新的算法设计吗? 还是说仅仅是为了数学证明?
页:
[1]