ainatipen 发表于 2022-5-25 08:56

《最优化:建模、算法与理论》笔记|2.1范数,2.2导数

2.1范数

定义(范数)称一个从向量空间到实数域 https://www.zhihu.com/equation?tex=%5Cmathbb+%7BR%7D 的非负函数为范数(在度量空间中也被称为度量),如果它满足:
(1)正定性:对于所有的,有 https://www.zhihu.com/equation?tex=%5C%7Cv%5C%7C%5Cgeq0 ,且 https://www.zhihu.com/equation?tex=%5C%7Cv%5C%7C%3D0 当且仅当 https://www.zhihu.com/equation?tex=v%3D0 ;
(2)齐次性:对于所有的和 https://www.zhihu.com/equation?tex=%5Calpha%5Cin%5Cmathbb+%7BR%7D ,有 https://www.zhihu.com/equation?tex=%5C%7C%5Calpha%7Bv%7D%5C%7C%3D%5Cleft%7C+%5Calpha+%5Cright%7C%5C%7Cv%5C%7C ;
(3)三角不等式:对于所有的 https://www.zhihu.com/equation?tex=v%2Cw%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%7D ,有 https://www.zhihu.com/equation?tex=%5C%7Cv%2Bw%5C%7C%5Cleq%5C%7Cv%5C%7C%2B%5C%7Cw%5C%7C 。向量范数

1)最常用的向量范数为 ( https://www.zhihu.com/equation?tex=p%5Cgeq1 )范数 :
https://www.zhihu.com/equation?tex=%5C%7Cv%5C%7C_p%3D%28%7Cv_1%7C%5E%7Bp%7D%2B%7Cv_2%7C%5E%7Bp%7D%2B%C2%B7%C2%B7%C2%B7%2B%7Cv_n%7C%5E%7Bp%7D%29%5E%7B%5Cfrac%7B1%7D%7Bp%7D%7D%5C%5C

https://www.zhihu.com/equation?tex=p%3D1%2C2%2C%5Cinfty 为最常遇见的情形,分别记为 https://www.zhihu.com/equation?tex=%5C%7C%C2%B7%5C%7C_1%2C%5C%7C%C2%B7%5C%7C_2 和 https://www.zhihu.com/equation?tex=%5C%7C%C2%B7%5C%7C_%5Cinfty ,其中, https://www.zhihu.com/equation?tex=%5C%7Cv%5C%7C_%5Cinfty%3D%5Cmathop%7Bmax%7D%5Climits_%7Bi%7D%7Cv_%7Bi%7D%7C .
而在不引起歧义的情况下,会省去范数的角标,记为.
对于范数,有如下命题成立:
命题(Cauchy不等式)设 https://www.zhihu.com/equation?tex=a%2Cb%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%7D ,则有 https://www.zhihu.com/equation?tex=%7Ca%5E%7BT%7Db%7C%5Cleq%5C%7Ca%5C%7C_2%5C%7Cb%5C%7C_2 成立,等号成立当且仅当 https://www.zhihu.com/equation?tex=a 与 https://www.zhihu.com/equation?tex=b 线性相关。
2)由正定矩阵A诱导的范数:
https://www.zhihu.com/equation?tex=%5C%7Cx%5C%7C_A%5Cxlongequal%7Bdef%7D%5Csqrt%7Bx%5E%7BT%7DAx%7D%5C%5C
矩阵范数

1)由向量 范数推广得到的矩阵 范数:
当 https://www.zhihu.com/equation?tex=p%3D1 时,矩阵 https://www.zhihu.com/equation?tex=a%5Cin%5Cmathbb+%7BR%7D%5E%7Bm%5Ctimes%7Bn%7D%7D 的 https://www.zhihu.com/equation?tex=l_1 范数定义为 https://www.zhihu.com/equation?tex=%5C%7CA%5C%7C_1%3D%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%7B%5Csum_%7Bj%3D1%7D%5E%7Bn%7D%7B%7Ca_%7Bij%7D%7C%7D%7D ,即中所有元素的绝对值之和;
当 https://www.zhihu.com/equation?tex=p%3D2 时,得到矩阵的Frobenius范数(简称为F范数),记为 https://www.zhihu.com/equation?tex=%5C%7CA%5C%7C_F ,其满足 https://www.zhihu.com/equation?tex=%5C%7CA%5C%7C_F%3D%5Csqrt%7Btr%28AA%5ET%29%7D%3D%5Csqrt%7B%5Csum_%7Bi%2Cj%7D%7Ba_%7Bij%7D%5E%7B2%7D%7D%7D%5C%5C
F范数还具有正交不变性,即对于任意的正交矩阵 https://www.zhihu.com/equation?tex=U%5Cin%5Cmathbb+%7BR%7D%5E%7Bm%5Ctimes%7Bm%7D%7D%2CV%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%5Ctimes%7Bn%7D%7D ,有 https://www.zhihu.com/equation?tex=%5C%7CUAF%5C%7C%5E%7B2%7D_F%3D%5C%7CA%5C%7C%5E%7B2%7D_F 成立(根据F范数的定义得出)
2)由向量范数诱导而来的算子范数(矩阵范数的一种):
先给出算子范数的相容性的定义:设是上的向量范数,是 https://www.zhihu.com/equation?tex=%5Cmathbb+%7BR%7D%5E%7Bm%5Ctimes%7Bn%7D%7D 上的矩阵范数,且有
https://www.zhihu.com/equation?tex=%5C%7CAx%5C%7C_%7B%28m%29%7D%5Cleq%5C%7CA%5C%7C_%7B%28m%2Cn%29%7D%5C%7Cx%5C%7C_%7B%28n%29%7D%5C%5C
成立,则称为与向量范数相容的矩阵范数。但并非所有的矩阵范数都与给定的向量范数相容,只有满足该定义的矩阵范数才可以说与给定的向量范数相容。
然后给出算子范数的定义:
给定矩阵,及 https://www.zhihu.com/equation?tex=m 维和 https://www.zhihu.com/equation?tex=n 维空间的向量范数和,则
https://www.zhihu.com/equation?tex=%5C%7CA%5C%7C_%7B%28m%2Cn%29%7D%3D%5Cmathop%7Bmax%7D%5Climits_%7Bx%5Cne0%7D%5Cfrac%7B%5C%7CAx%5C%7C_%7B%28m%29%7D%7D%7B%5C%7Cx%5C%7C_%7B%28n%29%7D%5C%5C%7D%3D%5Cmathop%7Bmax%7D%5Climits_%7Bx%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%7D%2C%5C%7Cx%5C%7C_%7B%28n%29%7D%3D1%7D%5C%7CAx%5C%7C_%7B%28m%29%7D%5C%5C
是与向量范数相容的矩阵范数,我们称该矩阵范数为由向量范数诱导出的算子范数。
如果将和都取为相应向量空间的范数,则可得到矩阵的 https://www.zhihu.com/equation?tex=p 范数,如矩阵的 https://www.zhihu.com/equation?tex=2 范数为
https://www.zhihu.com/equation?tex=%5C%7CA%5C%7C_%7B2%7D%3D%5Cmathop%7Bmax%7D%5Climits_%7Bx%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%7D%2C%5C%7Cx%5C%7C_%7B2%7D%3D1%7D%5C%7CAx%5C%7C_%7B2%7D%5C%5C
也就是该矩阵的最大奇异值.(因为 https://www.zhihu.com/equation?tex=%5C%7CAx%5C%7C_%7B2%7D%5E2%3D%7B%28Ax%29%7D%5ETAx%3Dx%5ET%5Clambda%7Bx%7D%3D%5Clambda%5C%7Cx%5C%7C_%7B2%7D%5E2 ,所以 https://www.zhihu.com/equation?tex=%5Clambda%3D%5Cfrac%7B%5C%7CAx%5C%7C_%7B2%7D%5E2%7D%7B%5C%7Cx%5C%7C_%7B2%7D%5E2%7D%3D%5Csigma%5E%7B2%7D )
3)核范数:
给定矩阵,其核范数定义为 https://www.zhihu.com/equation?tex=%5C%7CA%5C%7C_%7B%2A%7D%3D%5Csum_%7Bi%3D1%7D%5E%7Br%7D%7B%5Csigma_i%7D ,其中 https://www.zhihu.com/equation?tex=r%3Drank%28A%29 , https://www.zhihu.com/equation?tex=%5Csigma_i 为的所有非零奇异值。我们常常通过限制矩阵的核范数(因为核范数能凸近似矩阵的秩)来保证矩阵的低秩性,矩阵的低秩性表明该矩阵的各行或各列存在相似性,因此易利用相似性来对缺失数据进行恢复。
矩阵内积

范数一般用来衡量矩阵模的大小,而内积一般用来表征两个矩阵(或其张成的空间)之间的夹角。
这里,我们只介绍一种常用的内积——Frobenius内积: 的矩阵和的Frobenius内积定义为
https://www.zhihu.com/equation?tex=%5Cleft+%5Clangle+A%2CB+%5Cright%5Crangle%5Cxlongequal%7Bdef%7Dtr%28AB%5ET%29%3D%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%7B%5Csum_%7Bj%3D1%7D%5E%7Bn%7D%7Ba_%7Bij%7Db_%7Bij%7D%7D%7D%5C%5C
即两个矩阵逐分量相乘的和,当 https://www.zhihu.com/equation?tex=A%3DB 时, https://www.zhihu.com/equation?tex=%5Cleft+%5Clangle+A%2CB+%5Cright%5Crangle 等于矩阵的F范数的平方。
我们还有矩阵范数对应的Cauchy不等式,如下:
命题(矩阵范数的Cauchy不等式)设 https://www.zhihu.com/equation?tex=A%2CB%5Cin%5Cmathbb+%7BR%7D%5E%7Bm%5Ctimes%7Bn%7D%7D ,则有 https://www.zhihu.com/equation?tex=%7C%5Cleft+%5Clangle+A%2CB+%5Cright%5Crangle%7C%5Cleq%5C%7CA%5C%7C_F%5C%7CB%5C%7C_F 成立,其中,等号成立当且仅当和线性相关。
2.2导数

我们可以利用函数值和导数信息,来构造容易求解的且具有很好的逼近原问题效果的子问题,从而构造各种有效的算法。
多元可微函数

我们先给出梯度和Hessian矩阵的定义:
定义(梯度)给定函数,且在点的一个邻域内有意义,若存在向量 https://www.zhihu.com/equation?tex=g%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%7D 满足
https://www.zhihu.com/equation?tex=%5Clim_%7Bp+%5Crightarrow+0%7D%7B%5Cfrac%7Bf%28x%2Bp%29-f%28x%29-g%5ETp%7D%7B%5C%7Cp%5C%7C%7D%7D%3D0%5C%5C
其中,是任意的向量范数,就称在点处可微(或Frechet可微)。此时, https://www.zhihu.com/equation?tex=g 称为在点处的梯度,记作。如果对区域上的每一个点都有存在,则称在上可微。
可知, https://www.zhihu.com/equation?tex=%5Cnabla%7Bf%28x%29%7D%3D%5Cleft%5B+%5Cfrac%7B%5Cpartial+f%28x%29%7D%7B%5Cpartial+x_1%7D%2C+%5Cfrac%7B%5Cpartial+f%28x%29%7D%7B%5Cpartial+x_2%7D%2C%C2%B7%C2%B7%C2%B7%2C%5Cfrac%7B%5Cpartial+f%28x%29%7D%7B%5Cpartial+x_n%7D%5Cright%5D%5ET
如果我们只关心对一部分变量的梯度,那么可以在 https://www.zhihu.com/equation?tex=%5Cnabla 下加下标,如 https://www.zhihu.com/equation?tex=%5Cnabla_xf%28x%2Cy%29 表示当把视为常数时,关于的梯度。
定义(Hessian矩阵)如果函数在点处的二阶偏导数 https://www.zhihu.com/equation?tex=+%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_i%5Cpartial+x_j%7D+%2Ci%2Cj%3D1%2C2%2C%C2%B7%C2%B7%C2%B7%2Cn 都存在,则
https://www.zhihu.com/equation?tex=+%5Cnabla%5E2%7Bf%28x%29%7D%3D+%5Cbegin%7Bbmatrix%7D+%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x%5E%7B2%7D_1%7D++%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_1%5Cpartial+x_2%7D++%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_1%5Cpartial+x_3%7D++%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_1%5Cpartial+x_n%7D+%5C%5C+%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_2%5Cpartial+x_1%7D++%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x%5E%7B2%7D_2%7D++%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_2%5Cpartial+x_3%7D++%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_2%5Cpartial+x_n%7D+%5C%5C+%5Cvdots+%26%5Cvdots%26%5Cvdots%26%26%5Cvdots%5C%5C+%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_n%5Cpartial+x_1%7D++%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_n%5Cpartial+x_2%7D++%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x_n%5Cpartial+x_3%7D++%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+%5E%7B2%7Df%28x%29%7D%7B%5Cpartial+x%5E%7B2%7D_n%7D+%5C%5C+%5Cend%7Bbmatrix%7D%5C%5C
称为在点处的Hessian矩阵。
当在区域上的每个点处都存在时,称在上二阶可微;当在上的每个点都存在且在上连续时,则称在上二阶连续可微,此时,Hessian矩阵是一个对称矩阵。
然后给出如下形式的泰勒展开:
定理(多元函数的泰勒展开)设是连续可微的, https://www.zhihu.com/equation?tex=p%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%7D 为向量,那么
https://www.zhihu.com/equation?tex=f%28x%2Bp%29%3Df%28x%29%2B%5Cnabla%7Bf%28x%2Btp%29%5ETp%7D%5C%5C
其中,。进一步地,如果是二阶连续可微的,则
https://www.zhihu.com/equation?tex=%5Cnabla%7Bf%28x%2Bp%29%7D%3D%5Cnabla%7Bf%28x%29%7D%2B%5Cint_%7B0%7D%5E%7B1%7D%5Cnabla%5E%7B2%7D%7Bf%28x%2Btp%29p+dt%7D%5C%5C+f%28x%2Bp%29%3Df%28x%29%2B%5Cnabla%7Bf%28x%29%5ETp%7D%2B%5Cfrac%7B1%7D%7B2%7Dp%5E%7BT%7D%5Cnabla%5E%7B2%7D%7Bf%28x%2Btp%29p%7D%5C%5C
其中,。
最后,我们将介绍一种特殊的可微函数——梯度利普希茨(Lipschitz)连续的函数,该类函数在很多优化算法的收敛性证明中起着关键作用。
定义(梯度Lipschitz连续)给定可微函数,若存在 https://www.zhihu.com/equation?tex=L%3E0 ,对任意的 https://www.zhihu.com/equation?tex=x%2Cy%5Cin%7Bdom%7Df (函数的定义域)有 https://www.zhihu.com/equation?tex=%5C%7C%5Cnabla%7Bf%28x%29%7D-%5Cnabla%7Bf%28y%29%7D%5C%7C%5Cleq%7BL%7D%5C%7Cx-y%5C%7C 成立,则称是梯度利普希茨(Lipschitz)连续的,为利普希茨常数。有时也简记为梯度 -利普希茨连续或 -光滑。
梯度Lipschitz连续表明的变化受的变化的控制。满足梯度Lipschitz连续的函数具有很多很好的性质,如下面的二次上界:
引理(二次上界)设可微函数的定义域 https://www.zhihu.com/equation?tex=domf%3D%5Cmathbb+%7BR%7D%5E%7Bn%7D ,且为梯度Lipschitz连续的,则函数有二次上界:
https://www.zhihu.com/equation?tex=f%28y%29%5Cleq%7Bf%28x%29%7D%2B%5Cnabla%7Bf%28x%29%7D%5ET%28y-x%29%2B%5Cfrac%7BL%7D%7B2%7D%5C%7Cy-x%5C%7C%5E2%2C%5Cforall%7Bx%2Cy%5Cin%7Bdomf%7D%7D%5C%5C
证明. 构造辅助函数 https://www.zhihu.com/equation?tex=g%28t%29%3Df%28%281-t%29x%2Bty%29%2Ct%5Cin%5Cleft%5B+0%2C1+%5Cright%5D ,则有 https://www.zhihu.com/equation?tex=g%5E%7B%27%7D%28t%29%3D%5Cnabla%7Bf%28x%2Bt%28y-x%29%29%5ET%28y-x%29%7D 成立,所以
https://www.zhihu.com/equation?tex=%5Cbegin%7Balign%7D+%26%5Cquad+f%28y%29-%7Bf%28x%29%7D-%5Cnabla%7Bf%28x%29%7D%5ET%28y-x%29%5C%5C++%26%3Dg%281%29-g%280%29-g%5E%7B%27%7D%280%29%5C%5C+%26%3D%5Cint_%7B0%7D%5E%7B1%7Dg%5E%7B%27%7D%28t%29dt-g%5E%7B%27%7D%280%29%5C%5C+%26%3D%5Cint_%7B0%7D%5E%7B1%7D%28g%5E%7B%27%7D%28t%29-g%5E%7B%27%7D%280%29%29%5C%5C+%26%3D%5Cint_%7B0%7D%5E%7B1%7D%28%5Cnabla%7Bf%28x%2Bt%28y-x%29%29-%5Cnabla%7Bf%28x%29%7D%29%5ET%28y-x%29%7Ddt%5C%5C+%26%5Cleq%5Cint_%7B0%7D%5E%7B1%7D%5C%7C%5Cnabla%7Bf%28x%2Bt%28y-x%29%29-%5Cnabla%7Bf%28x%29%7D%5C%7C%5C%7Cy-x%5C%7C%7Ddt%5C%5C+%26%5Cleq%5Cint_%7B0%7D%5E%7B1%7DL%5C%7Cy-x%5C%7C%5E%7B2%7Dtdt%5C%5C+%26%3D%5Cfrac%7BL%7D%7B2%7D%5C%7Cy-x%5C%7C%5E2++%5Cend%7Balign%7D%5C%5C+
该引理表明可以被一个二次函数上界所控制,即要求的增长速度不超过二次。
当对该引理中的定义域的要求减弱为凸集时,该引理也成立,因为定义域是凸集可以保证 https://www.zhihu.com/equation?tex=g%28t%29 当 https://www.zhihu.com/equation?tex=t%5Cin%5Cleft%5B+0%2C1+%5Cright%5D 时是有定义的。
如果是梯度Lipschitz连续的,且有一个全局极小点,那么我们可以利用二次上界来估计 https://www.zhihu.com/equation?tex=f%28x%29-f%28x%5E%7B%2A%7D%29 的大小:
推论 设可微函数的定义域为,且存在一个全局极小点,若为梯度Lipschitz连续的,则对任意的有 https://www.zhihu.com/equation?tex=%5Cfrac%7B1%7D%7B2L%7D%5C%7C%5Cnabla%7Bf%28x%29%7D%5C%7C%5E2%5Cleq%7Bf%28x%29-f%28x%5E%2A%29%7D 成立。
证明. 固定,对任意的应用二次上界,得https://www.zhihu.com/equation?tex=f%28x%5E%2A%29%5Cleq%7Bf%28y%29%7D%5Cleq%7Bf%28x%29%7D%2B%5Cnabla%7Bf%28x%29%5ET%28y-x%29%2B%5Cfrac%7BL%7D%7B2%7D%5C%7Cy-x%5C%7C%5E2%7D
那么我们有 https://www.zhihu.com/equation?tex=f%28x%5E%2A%29%5Cleq+%5Cinf%5Climits_%7By%5Cin%5Cmathbb+%7BR%7D%5E%7Bn%7D%7D%5Cleft%5C%7B+%7Bf%28x%29%7D%2B%5Cnabla%7Bf%28x%29%5ET%28y-x%29%2B%5Cfrac%7BL%7D%7B2%7D%5C%7Cy-x%5C%7C%5E2%7D+%5Cright%5C%7D%3Df%28x%29-%5Cfrac%7B1%7D%7B2L%7D%5C%7C%5Cnabla%7Bf%28x%29%7D%5C%7C%5E2 成立,最后一步应用了二次函数的性质:当 https://www.zhihu.com/equation?tex=y%3Dx-%5Cfrac%7B%5Cnabla%7Bf%28x%29%7D%7D%7BL%7D 时取到最小值。
向量值函数

当 https://www.zhihu.com/equation?tex=f%3A%5Cmathbb+%7BR%7D%5E%7Bn%7D%5Crightarrow%5Cmathbb+%7BR%7D%5E%7Bm%7D 是向量值函数时,我们可以定义它的Jacobi矩阵 https://www.zhihu.com/equation?tex=J%28x%29%5Cin%5Cmathbb+%7BR%7D%5E%7Bm%5Ctimes%7Bn%7D%7D ,它的第 https://www.zhihu.com/equation?tex=i 行是分量 https://www.zhihu.com/equation?tex=f_i%28x%29 的梯度的转置,即
https://www.zhihu.com/equation?tex=J%28x%29%3D+%5Cbegin%7Bbmatrix%7D+%5Cfrac%7B%5Cpartial+f_1%28x%29%7D%7B%5Cpartial+x_1%7D+%26%5Cfrac%7B%5Cpartial+f_1%28x%29%7D%7B%5Cpartial+x_2%7D+%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+f_1%28x%29%7D%7B%5Cpartial+x_n%7D%5C%5C+%5Cfrac%7B%5Cpartial+f_2%28x%29%7D%7B%5Cpartial+x_1%7D+%26%5Cfrac%7B%5Cpartial+f_2%28x%29%7D%7B%5Cpartial+x_2%7D+%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+f_2%28x%29%7D%7B%5Cpartial+x_n%7D%5C%5C+%5Cvdots+%26%5Cvdots%26%26%5Cvdots%5C%5C+%5Cfrac%7B%5Cpartial+f_m%28x%29%7D%7B%5Cpartial+x_1%7D+%26%5Cfrac%7B%5Cpartial+f_m%28x%29%7D%7B%5Cpartial+x_2%7D+%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+f_m%28x%29%7D%7B%5Cpartial+x_n%7D%5C%5C+%5Cend%7Bbmatrix%7D%5C%5C
可以看出,梯度的Jacobi矩阵就是的Hessian矩阵。
矩阵变量函数

我们将多元函数梯度的定义加以推广,得到变量是矩阵的情形:
对于以的矩阵为自变量的函数,若存在矩阵满足
https://www.zhihu.com/equation?tex=%5Clim_%7BV+%5Crightarrow+0%7D%7B%5Cfrac%7Bf%28X%2BV%29-f%28X%29-%5Cleft+%5Clangle+G%2CV+%5Cright%5Crangle%7D%7B%5C%7CV%5C%7C%7D%7D%3D0%5C%5C
其中,是任意的矩阵范数,就称矩阵变量函数在Frechet可微,称为在Frechet可微意义下的梯度。
类似于向量情形,矩阵变量函数的梯度可以用其偏导数表示为
https://www.zhihu.com/equation?tex=%5Cnabla%7Bf%28x%29%7D%3D+%5Cbegin%7Bbmatrix%7D+%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7B11%7D%7D+%26%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7B12%7D%7D+%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7B1n%7D%7D%5C%5C+%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7B21%7D%7D+%26%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7B22%7D%7D+%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7B2n%7D%7D%5C%5C+%5Cvdots+%26%5Cvdots%26%26%5Cvdots%5C%5C+%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7Bm1%7D%7D+%26%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7Bm2%7D%7D+%26%5Ccdots+%26%5Cfrac%7B%5Cpartial+f%7D%7B%5Cpartial+x_%7Bmn%7D%7D%5C%5C+%5Cend%7Bbmatrix%7D%5C%5C
但在实际的应用中,矩阵Frechet可微的定义和使用往往比较繁琐,因此,我们将介绍另一种定义,该定义实际上是方向导数的某种推广:
定义(Gateaux可微)设为矩阵变量函数,如果存在矩阵,对任意方向 https://www.zhihu.com/equation?tex=V%5Csubset%5Cmathbb+%7BR%7D%5E%7Bm%5Ctimes%7Bn%7D%7D 满足
https://www.zhihu.com/equation?tex=%5Clim_%7Bt+%5Crightarrow+0%7D%7B%5Cfrac%7Bf%28X%2BtV%29-f%28X%29-t%5Cleft+%5Clangle+G%2CV+%5Cright%5Crangle%7D%7Bt%7D%7D%3D0%5C%5C
则称关于是Gateaux可微的,称为在处在Gateaux可微意义下的梯度。
(?)可以看出,若是Frechet可微的,则也是Gateaux可微的,且二者意义下的梯度相等;但这一命题反过来不一定成立。
自动微分

自动微分是使用计算机计算导数的算法。它有两种方式:前向模式和后向模式。
在前向模式中,我们依次计算每个中间变量的值和中间变量对其父变量的偏导数值,然后根据链式法则,可以求得每个中间变量对自变量的偏导数值,直到得到目标函数值和目标函数对自变量的偏导数值。可以看出,在该模式下,变量的求值和偏导的计算是同时进行的。
而后向模式则是先利用前向模式计算各个变量的值,再逆向计算目标函数对各个变量的偏导数值。
后向模式的梯度计算代价至多为函数值计算代价的5倍,但前向模式的计算代价可能多达函数值计算代价的n(n为自变量维数)倍,因此,后向模式在实际中更为流行。对于神经网络中的优化问题,其自动微分采用的是后向模式。
页: [1]
查看完整版本: 《最优化:建模、算法与理论》笔记|2.1范数,2.2导数