超参数优化(二): 贝叶斯优化

franciscochonge 发表于 2021-11-19 21:14

引言：近年来贝叶斯优化在求解黑盒函数问题中应用越来越广泛，已经成为超参数优化的主流方法。贝叶斯优化是一种全局优化的方法，目标函数只需要满足一致连续或者利普希茨连续等局部平滑性假设；引入采集函数，进行有效探索和利用，能在较少的评估次数下取得复杂目标函数的近似解。
<hr/>2.1 背景
2.2 贝叶斯优化建模
2.2.1 问题定义
2.2.2 贝叶斯优化框架
2.3 代理模型
2.4 采集函数
2.5 贝叶斯优化进一步探讨二、贝叶斯优化

2.1 背景

自动超参数优化是黑盒优化的一个具体应用。黑盒优化具有如下的特点：
优化目标：没有显式的函数表达式，获取不到其梯度信息；其值域分布是多峰、非凸非凹的复杂昂贵函数；函数空间：由连续、离散整数值构成的，且实际应用中维度较高；优化目标求解：求解非常昂贵耗时，在工业设计中通常一次评估花费数十万至百万元以上；在大规模的深度学习训练中通常花费若干天甚至数周时间进行一次参数搜索。
黑盒优化的求解方法：
元启发式算法(meta-heuristic algorithm)：遗传算法 (genetic algorithm, GA)、差分进化(differential evolution, DE) 算法、粒子群算法 (particle swarm optimization, PSO)、进化策略(evolutionary strategies, ES)无导数优化方法：随机搜索、直接搜索法；基于模型的如多项式方法，径向基函数插值方法和基于代理模型方法，如贝叶斯优化。
近年来，贝叶斯优化在求解黑盒函数问题中应用越来越广泛，已经成为超参数优化的主流方法。贝叶斯优化的优势：
是一种全局优化的方法，目标函数只需要满足一致连续或者利普希茨连续(Lipschitz) 等局部平滑性假设；能在较少的评估次数下，取得复杂目标函数的近似解；引入采集函数，进行有效的探索和利用(Exploration and Exploitation)
我们将在本节末尾讨论贝叶斯优化的局限性。
2.2 贝叶斯优化建模

2.2.1 问题定义

超参数优化问题定义：

https://www.zhihu.com/equation?tex=%5Cbegin%7Baligned%7D+argmin_%7Bx%5Cin+X%7D+%5Censpace+f%28x%29++++%5Cend%7Baligned%7D+%5Ctag%7B3%7D+
其中，为超参数的一组设置取值，https://www.zhihu.com/equation?tex=X为混合设计空间(Design Space)。为超参数优化中，我们需要优化的目标（如最小化Loss)。
超参数优化算法的目标是以最快的方式找到全局最优解https://www.zhihu.com/equation?tex=x%5E%2A+%3D+argmax_%7Bx%5Cin+X%7D%5Censpace+f%28x%29+ 。
为了达到这个目标，贝叶斯算法求解超参数优化问题，通常分为两步：
第1步：学习一个代理模型(Surrogate Model)；第2步：通过采集函数(Acquisition Function)，来决定输出下一个采集点。
2.2.2 贝叶斯优化框架

由贝叶斯定理：

https://www.zhihu.com/equation?tex=%5Cbegin%7Baligned%7D+%E5%90%8E%E9%AA%8C%E5%88%86%E5%B8%83+%26%3D+%E5%85%88%E9%AA%8C%E5%88%86%E5%B8%83+%2B+%E8%A7%82%E5%AF%9F%E6%95%B0%E6%8D%AE%5C%5C+p%28A%7CB%29+%26%3D+%5Cfrac%7Bp%28B%7CA%29+%5Ccdot+p%28A%29%7D%7B+p%28B%29%7D+%5Cend%7Baligned%7D+%5Ctag%7B4%7D+
https://www.zhihu.com/equation?tex=p%28A%29 为先验分布，即我们的代理模型；https://www.zhihu.com/equation?tex=p%28B%7CA%29 为给定代理模型，观察数据的分布；https://www.zhihu.com/equation?tex=p%28A%7CB%29 为后验分布，即给定观察数据之后代理模型新分布。
贝叶斯优化框架：

图1：贝叶斯优化框架

步骤1：随机初始化https://www.zhihu.com/equation?tex=n_0点https://www.zhihu.com/equation?tex=X_%7Binit%7D+%3D+%5C%7Bx_0%2Cx_1%2C...%2Cx_%7Bn_0-1%7D%5C%7D步骤2：获得其对应的函数值https://www.zhihu.com/equation?tex=f%28X_%7Binit%7D%29，初始点集https://www.zhihu.com/equation?tex=D_0%3D%5C%7BX_%7Binit%7D+%2Cf%28X_%7Binit%7D%29%5C%7D令https://www.zhihu.com/equation?tex=t%3Dn_0, https://www.zhihu.com/equation?tex=D_%7Bt-1%7D+%3D+D_0，循环直到达到最大次数https://www.zhihu.com/equation?tex=N
步骤3：根据当前获得的点集https://www.zhihu.com/equation?tex=D_%7Bt-1%7D，构建代理模型步骤4：基于代理模型，最大化采集函数https://www.zhihu.com/equation?tex=%5Calpha%28x%7CD_%7Bt-1%7D%29，获得下一个评估点:

https://www.zhihu.com/equation?tex=x_t%3D+argmin+%5Censpace+%5Calpha%28x%7CD_%7Bt-1%7D%29+%5C%5C

步骤5：获得评估点https://www.zhihu.com/equation?tex=x_t的函数值https://www.zhihu.com/equation?tex=f%28x_t%29, 将其加入到当前评估点集合中:

https://www.zhihu.com/equation?tex=D_t%3DD_%7Bt-1%7D+U+%5C%7Bx_t%2Cf%28x_t%29+%5C%7D+%5C%5C
步骤6：输出最优候选评估点：https://www.zhihu.com/equation?tex=%5C%7Bx%5E%2A%2Cy%5E%2A%5C%7D
2.3 代理模型

当黑盒函数的自变量是连续值时，高斯过程回归模型(GPs)是一个非常高效的代理模型。对于黑盒函数，我们通常假设其符合某一个GPs先验分布。高斯过程回归模型(GPs)有如下的两方面来决定：
均值函数https://www.zhihu.com/equation?tex=m%28x%29 ；协方差函数(矩阵)，或者核函数https://www.zhihu.com/equation?tex=k_%7B%5Ctheta%7D%28x%2Cx%5E%7B%27%7D%29 ，其中https://www.zhihu.com/equation?tex=%5Ctheta为核函数超参数。
基于上面两点，我们通常假设我们观测到的函数值https://www.zhihu.com/equation?tex=y_%7Bl%7D通过如下的形式产生：

https://www.zhihu.com/equation?tex=y_%7Bl%7D+%3D+f%28x_l%29+%2B+%5Cepsilon_%7Bl%7D%EF%BC%8C+where+%5Censpace+%5Cepsilon_%7Bl%7D+%5Csim++N%280%2C%5Csigma_%7Bnoise%7D%5E2+%29+%5C%5C
进而，可以得到高斯似然函数形式：

https://www.zhihu.com/equation?tex=y_%7Bl%7D%7Cx_%7Bl%7D+%5Csim+++N%28f_l+%2C%5Csigma_%7Bnoise%7D%5E2+%29+%5C%5C
其中https://www.zhihu.com/equation?tex=f_l+%3D+f%28x_l%29,服从如下的分布：

https://www.zhihu.com/equation?tex=f%28x%29+%5Csim+GP%28m%28x%29%2Ck_%7B%5Ctheta%7D%28x%2Cx%5E%7B%27%7D%29%29+%5Ctag%7B5%7D+
通常假设GP kernel 是平稳的，只依赖于两个点和https://www.zhihu.com/equation?tex=x%5E%7B%27%7D之间的模 https://www.zhihu.com/equation?tex=%7C%7Cx-x%5E%7B%27%7D%7C%7C，进一步我们假设在带噪音分布条件数据点是平稳的且同质的。如果我们观测的数据点https://www.zhihu.com/equation?tex=%5C%7Bx%2Cy%5C%7D不满足这个条件，我们将不能很好的近似黑盒函数。
2.4 采集函数

给定先验和观察数据，由贝叶斯定理我们可以得到相应的后验分布：

https://www.zhihu.com/equation?tex=%E5%90%8E%E9%AA%8C%E5%88%86%E5%B8%83%3D+%E5%85%88%E9%AA%8C%E5%88%86%E5%B8%83+%2B+%E8%A7%82%E5%AF%9F%E6%95%B0%E6%8D%AE+%5C%5C
采集函数是根据后验分布https://www.zhihu.com/equation?tex=p_%7B%5Ctheta%7D%28f%28%5Ccdot%29%7CD%29来构造的。在选取GPs作为代理模型条件下，其后验分布仍然是高斯分布：

https://www.zhihu.com/equation?tex=p%28f%28x_%7B1%3Aq%7D%29%7C+D%29+%3D+N%28%5Cmu_%7B%5Ctheta%7D%28x_%7B1%3Aq%7D%29%2C+%5Csum_%7B%5Ctheta%7D%28x_%7B1%3Aq%7D%29%29+%5Ctag%7B6%7D+
接下来我们列出贝叶斯优化中常见的3种采集函数：
提升的概率https://www.zhihu.com/equation?tex=%28Probability+improvement+%2CPI%29:

https://www.zhihu.com/equation?tex=%5Calpha_%7BPI%7D%5E%7B%5Ctheta%7D+%28x_%7B1%3Aq%7D+%7C+D+%29+%3D+E_%7Bpost.%7D%5B%5Cmax_%7Bj%5Cin+1%3Aq%7D%5C%7BH%5C%7Bf%28x_j%29+-+f%28x%5E%2B%29+%5C%7D%5C%7D%5D+%5Ctag%7B7%7D+
其中https://www.zhihu.com/equation?tex=H%5C%7B.%5C%7D为左连续的阶跃函数或开关函数。采集函数PI的原理是在当前https://www.zhihu.com/equation?tex=x%5E%2B的邻域附近，找到比大的候选点，取这些概率最大的点，而不管比大多少。
提升的期望https://www.zhihu.com/equation?tex=%28Expected+Improvement+%2CEI%29:

https://www.zhihu.com/equation?tex=%5Calpha_%7BEI%7D%5E%7B%5Ctheta%7D+%28x_%7B1%3Aq%7D%7CD%29+%3D+E_%7Bpost.%7D%5B%5Cmax_%7Bj%5Cin+1%3Aq%7D%5C%7BReLU%28f%28x_j%29+-+f%28x%5E%2B%29%29%5C%7D%5D+%5Ctag%7B8%7D+
相比https://www.zhihu.com/equation?tex=PI，给出了下一候选点比当前最优值大多少的问题（提升期望)
上置信边界https://www.zhihu.com/equation?tex=%28Upper+confidence+bound+%EF%BC%8CUCB%29:

https://www.zhihu.com/equation?tex=%5Calpha_%7BUBC%7D%5E%7B%5Ctheta%7D+%28x_%7Bj%7D%29+%3D+E_%7Bpost.%7D%5B%5Cmax_%7Bj%5Cin+1%3Aq%7D%5C%7B+%5Cmu_%7B%5Ctheta%7D+%28x_j%29+%2B+%5Csqrt%7B%5Cbeta+%5Cpi+%2F2+%7D+%7C+%5Cgamma_%7B%5Ctheta%7D+%28x_j%29+%7C+%5C%7D%5D+%5Ctag%7B9%7D+
其中https://www.zhihu.com/equation?tex=%5Cmu_%7B%5Ctheta%7D+%28x_j%29是后续点的后验均值，https://www.zhihu.com/equation?tex=%5Cgamma_%7B%5Ctheta%7D+%28x_j%29+%3D+f%28x_j%29+-+%5Cmu_%7B%5Ctheta%7D+%28x_j%29
不同的采集函数在处理具体问题时有不同的优缺点：

图2：采集函数优缺点对比

和https://www.zhihu.com/equation?tex=UCB都倾向于选取高均值高方差的候选点，但候选点并不趋一致：

图3：EI和UCB不一致情况

2.5 贝叶斯优化进一步探讨

贝叶斯优化中存在的难点：
代理模型：我们通常选取GPs；当我们求解的问题维度变高时，GPs求解的复杂度变高；如何求解高维的贝叶斯优化问题，是一个值得探讨和研究的问题；采集函数：通常选取上述函数中的一个作为代理模型的采集函数（中途不会更改采集函数）。这个采集函数相对于我们的代理模型和我们要优化的问题是否是最优的呢？我们在HEBO这一节详细讨论这个问题；和具体优化问题相关：在应用贝叶斯优化求解实际问题，通常要根据实际问题，结合相关的领域知识选择合适的代理模型和采集函数，如我们后续讨论的没有”一招鲜吃遍天“的方法。
<hr/>欢迎关注:simplex101,了解更多超参数优化（黑盒优化）分享内容。

参考文献

QQ浏览器2021AI算法大赛，https://algo.browser.qq.com/ Turner, Ryan, et al. &#34;Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020.&#34; arXiv preprint arXiv:2104.10201 (2021). TCowen-Rivers, Alexander I., et al. &#34;Hebo: Heteroscedastic evolutionary bayesian optimisation.&#34; arXiv e-prints (2020): arXiv-2012. Liu, Jiwei, Bojan Tunguz, and Gilberto Titericz. &#34;GPU Accelerated Exhaustive Search for Optimal Ensemble of Black-Box Optimization Algorithms.&#34; arXiv preprint arXiv:2012.04201 (2020). Sazanovich, Mikita, et al. &#34;Solving black-box optimization challenge via learning search space partition for local bayesian optimization.&#34; NeurIPS 2020 Competition and Demonstration Track. PMLR, 2021. Wang, Linnan, Rodrigo Fonseca, and Yuandong Tian. &#34;Learning search space partition for black-box optimization using monte carlo tree search.&#34; arXiv preprint arXiv:2007.00708 (2020). Eriksson, David, et al. &#34;Scalable global optimization via local bayesian optimization.&#34; Advances in Neural Information Processing Systems 32 (2019): 5496-5507. Eriksson, David, David Bindel, and Christine A. Shoemaker. &#34;pySOT and POAP: An event-driven asynchronous framework for surrogate optimization.&#34; arXiv preprint arXiv:1908.00420 (2019). Regis, Rommel G., and Christine A. Shoemaker. &#34;A stochastic radial basis function method for the global optimization of expensive functions.&#34; INFORMS Journal on Computing 19.4 (2007): 497-509. 基于径向基函数(RBF)的函数插值多目标优化总结：概念、算法和应用。知乎多目标优化专栏，https://www.zhihu.com/column/c_1360363335737843712 刘浩洋, 户将, 李勇锋，文再文，最优化：建模、算法与理论, 高教出版社，2020版 Garud, Sushant S., Iftekhar A. Karimi, and Markus Kraft. &#34;Design of computer experiments: A review.&#34; Computers & Chemical Engineering 106 (2017): 71-95. Viana, Felipe AC. &#34;A tutorial on Latin hypercube design of experiments.&#34; Quality and reliability engineering international 32.5 (2016): 1975-1985. 崔佳旭, 杨博. 贝叶斯优化方法和应用综述. 软件学报, 2018, 29(10): 3068-3090. Vu Nguyen.&#34;Tutorial on Recent Advances in Bayesian Optimization&#34; Asian Conference on Machine Learning (ACML), 2020. 江璞玉, 刘均, 周奇, 等. 大规模黑箱优化问题元启发式求解方法研究进展. 中国舰船研究, 2021, 16(4): 1–18 doi:10.19693/j.issn.1673-3185.02248 Larson, Jeffrey, Matt Menickelly, and Stefan M. Wild. &#34;Derivative-free optimization methods.&#34; Acta Numerica 28 (2019): 287-404.

页: [1]

Unity开发者联盟's Archiver

超参数优化(二): 贝叶斯优化