super1 发表于 2021-12-18 13:19

量化研究数学基础——最优化方法

引言

最优化方法,是整个机器学习甚至深度学习的基础,而在组合构建的时候,各标的的权重计算也需要用到最优化,本文将针对常用的最优化方法的数学基础进行介绍,以便于在模型开发的过程中更科学的选择优化器以及超参数配置。
一阶优化算法

目前,神经网络基本使用的都是一阶的优化算法,包括基本的梯度算法以及其他的衍生版本,梯度下降法的基本逻辑为:
设置最小梯度标准 https://www.zhihu.com/equation?tex=eps ,最大迭代次数 https://www.zhihu.com/equation?tex=N 以及学习率
初始化 https://www.zhihu.com/equation?tex=x_0%2Ck%3D0
如果https://www.zhihu.com/equation?tex=%7C%7C%5Cnabla+f%28x_k%29%7C%7C+%3E+eps 且 https://www.zhihu.com/equation?tex=k%3CN :

https://www.zhihu.com/equation?tex=x_%7Bk%2B1%7D%3Dx_k-%5Calpha+%5Cnabla+f%28x_k%29

https://www.zhihu.com/equation?tex=k+%3Dk%2B1
梯度下降法在每次迭代的时候会计算函数在该点的梯度,每次根据梯度乘以学习率来调整参数,从而达到降低函数值的效果,从数学上,可以通过泰勒展开进行理解。
对于第k次迭代,其调整前函数值为 https://www.zhihu.com/equation?tex=f%28x_k%29 ,调整后为 https://www.zhihu.com/equation?tex=f%28x_k-%5Calpha+%5Cnabla+f%28x_k%29%29 ,对于后者做泰勒展开:

https://www.zhihu.com/equation?tex=%5C%5C+f%28x_k-%5Calpha+%5Cnabla+f%28x%29%29%3Df%28x_k%29%2B%28%5Cnabla+f%28x_k%29%29%5ET%2A%28-%5Calpha+%5Cnabla+f%28x%29%29%2Bo%7C%7C-%5Calpha+%5Cnabla+f%28x%29%7C%7C+%5C%5C+%3Df%28x_k%29-%5Calpha+%28%5Cnabla+f%28x%29%29%5ET%28%5Cnabla+f%28x%29%29%2Bo+
忽略无穷小量,我们有:

https://www.zhihu.com/equation?tex=f%28x_k-%5Calpha+%5Cnabla+f%28x_k%29%29-f%28x_k%29%3D-%5Calpha+%28%5Cnabla+f%28x_k%29%29%5ET%5Cnabla+f%28x_k%29
由于 https://www.zhihu.com/equation?tex=%28%5Cnabla+f%28x_k%29%29%5ET%5Cnabla+f%28x_k%29%5Cgeq0 恒成立,则只要给定大于0,就可以保证每次迭代之后,损失函数满足 https://www.zhihu.com/equation?tex=f%28x_%7Bk%2B1%7D%29+%3C+f%28x_k%29 ,从而达到最小化损失函数的效果。
通常情况下,仅仅使用梯度下降法,可能会面临收敛速度慢的问题,因此有研究者开始尝试对梯度下降法进行改进,常见的改进算法包括Momentum,AdaGrad,RMSProp,AdaDelta,Adam。
Momentum,在更新参数时会引入上一次更新的值,用公式表达为:

https://www.zhihu.com/equation?tex=x_%7Bk%2B1%7D%3Dx_k%2Bv_k

https://www.zhihu.com/equation?tex=v_k%3D-%5Calpha+%5Cnabla+f%28x_k%29%2Bv_%7Bk-1%7D
AdaGrad,在更新参数的时候,对于梯度,会除以历史梯度在该方向累计值,从而使得参数在某个方向上更新很多后学习率降低,用公式表达为:

https://www.zhihu.com/equation?tex=x_%7Bk%2B1%7D%5Ei%3D%28x_k%5Ei%29-%5Calpha+%5Cfrac%7Bg_k%5Ei%7D%7B%5Csqrt%7B%5Csum_%7Bj%3D1%7D%5E%7Bk%7D%7B%28g_j%5Ei%29%5E2%2B%5Cvarepsilon%7D%7D%7D
其中i为向量在各个方向上的分量, https://www.zhihu.com/equation?tex=g_k 为第k次的梯度向量
RMSProp,RMSProp相对于AdaGrad的改进集中在分母部分,将历史梯度的平方值做了衰减之后再进行累加:
给定初始的E=0

https://www.zhihu.com/equation?tex=E%5Bg%5E2%5D_k%3D%5Cdelta+E%5Bg%5E2%5D_%7Bk-1%7D%2B%281-%5Cdelta%29g_k%5E2

https://www.zhihu.com/equation?tex=x_%7Bk%2B1%7D%5Ei%3D%28x_k%5Ei%29-%5Calpha+%5Cfrac%7Bg_k%5Ei%7D%7B%5Csqrt%7B%7BE%28g_k%5E2%29%5Ei%2B%5Cvarepsilon%7D%7D%7D
AdaDelta与RMSprop的计算逻辑相似,E的计算公式相同,但是增加了RMS:

https://www.zhihu.com/equation?tex=RMS%5Bg%5D_k%3D%5Csqrt%7BE%5Bg%5E2%5D_k%2B%5Cvarepsilon%7D
类似的,可以参照计算出 https://www.zhihu.com/equation?tex=RMS%5B%5CDelta+x%5D_%7Bk-1%7D
最终更新公式为:

https://www.zhihu.com/equation?tex=x_%7Bk%2B1%7D%3Dx_k-%5Cfrac%7BRMS%5B%5CDelta+x%5D_%7Bk-1%7D%7D%7BRMS%5Bg%5D_k%7Dg_k
Adam引入了两个向量m和v,初始值为0

https://www.zhihu.com/equation?tex=m_k%5Ei%3D%5Cbeta+_1m_%7Bk-1%7D%5Ei%2B%281-%5Cbeta_1%29g_k%5Ei

https://www.zhihu.com/equation?tex=v_k%5Ei%3D%5Cbeta_2v_%7Bk-1%7D%5Ei%2B%281-%5Cbeta_2%29%28g_k%5Ei%29%5E2
最终更新公式为:

https://www.zhihu.com/equation?tex=x_%7Bk%2B1%7D%5Ei%3D%28x_k%5Ei%29-%5Calpha+%5Cfrac%7B%5Csqrt%7B1-%5Cbeta_2%5Ek%7D%7D%7B%5Csqrt%7B1-%5Cbeta_1%5Ek%7D%7D%5Cfrac%7Bm_k%5Ei%7D%7B%5Csqrt%7Bv_k%5Ei%7D%2B%5Cvarepsilon%7D
二阶优化算法

牛顿法

牛顿法的基本思路是,对于一个函数f(x),其在x点取得极值的必要条件是: https://www.zhihu.com/equation?tex=%5Cnabla+f%28x%29%3D0 。但是实际上,直接解该方程十分麻烦,所以考虑对其做泰勒展开,牛顿法中,会展开到二阶:

https://www.zhihu.com/equation?tex=f%28x%29%3Df%28x_0%29%2B%5Cnabla+f%28x_0%29%5CDelta+x%2B1%2F2%5Cnabla+%5E2f%28x_0%29%5CDelta+x%5E2%2Bo%7C%7C%5CDelta+x%7C%7C%5E2
移项后两边除以 https://www.zhihu.com/equation?tex=%5CDelta+x ,可以得到:

https://www.zhihu.com/equation?tex=%5Cnabla+f%28x%29%3D%5Cnabla+f%28x_0%29%2B%5Cnabla%5E2f%28x_0%29%28x-x_0%29
然后带入最开始的方程组,可以解得:

https://www.zhihu.com/equation?tex=x%3Dx_0-%28%5Cnabla+%5E2f%28x_0%29%29%5E%7B-1%7D%5Cnabla+f%28x_0%29
其中,f(x)的二阶记为H,一阶导记为g。
由于我们对函数进行了泰勒展开,因此一次只能计算近似值,为了获得更精确的结果,还需要不断迭代,因此可以得到我们的迭代公式:

https://www.zhihu.com/equation?tex=x_%7Bk%2B1%7D%3Dx_k-%5Calpha+H%5E%7B-1%7D_kg_k
拟牛顿法

但是牛顿法存在一个致命的缺陷,就是当黑塞矩阵不可逆的时候,牛顿法就会失效,为了解决这一问题,就提出了拟牛顿法,典型的包括DFP算法、BFGS算法和L-BFGS算法,由于其基本逻辑类似,我们仅选择DFP进行说明。
重新来看 https://www.zhihu.com/equation?tex=%5Cnabla+f%28x%29%3D%5Cnabla+f%28x_%7Bk%2B1%7D%29%2B%5Cnabla%5E2f%28x_%7Bk%2B1%7D%29%28x-x_%7Bk%2B1%7D%29 ,我们令 https://www.zhihu.com/equation?tex=x%3Dx_%7Bk%7D 带入,可以得到:

https://www.zhihu.com/equation?tex=g_%7Bk%2B1%7D-g_k%3DH_%7Bk%2B1%7D%28x_%7Bk%2B1%7D-x_k%29 ,简写为 https://www.zhihu.com/equation?tex=s_k%3DH_%7Bk%2B1%7D%5E%7B-1%7Dy_k
因此,只要知道了s和y,就可以近似计算黑塞矩阵的逆矩阵,然后在每次迭代时进行更新,假设我们的迭代的目标是逆矩阵,为了方便表示,我们暂时先吧上标中的(-1)去掉,然后令:

https://www.zhihu.com/equation?tex=H_%7Bk%2B1%7D%3DH_k%2BE_k
E为矫正矩阵, 上述问题可以转化为求解E,给定E的格式为:

https://www.zhihu.com/equation?tex=E_k%3D%5Calpha_ku_ku_k%5ET%2B%5Cbeta_kv_kv_k%5ET
然后带入到上面的式子中,移项可得:

https://www.zhihu.com/equation?tex=%5Calpha_ku_ku_k%5ETy_k%2B%5Cbeta_kv_kv_k%5ETy_k%3Ds_k-H_ky_k
该方程解不唯一,为了方便计算,我们取一组特殊条件:

https://www.zhihu.com/equation?tex=%5Calpha_ku_ku_k%5ETy_k%3Ds_k%2C%5Cbeta_kv_kv_k%5ETy_k%3D-H_ky_k 且 https://www.zhihu.com/equation?tex=u_k%3Ds_k%2Cv_k%3DH_ky_k
带入方程解得:

https://www.zhihu.com/equation?tex=%5Calpha_k%3D%5Cfrac%7B1%7D%7Bs_k%5ETy_k%7D%2C%5Cbeta_k%3D-%5Cfrac%7B1%7D%7By_k%5ETH_ky_k%7D
将结果导入到H的更新公式,可以得到最终的表达式:

https://www.zhihu.com/equation?tex=H_%7Bk%2B1%7D%3DH_k-%5Cfrac%7BH_ky_ky_k%5ETH_k%7D%7By_k%5ETH_ky_k%7D%2B%5Cfrac%7Bs_ks_k%5ET%7D%7By_k%5ETs_k%7D
页: [1]
查看完整版本: 量化研究数学基础——最优化方法