{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "304f6107f0704109",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# 🦊 贝叶斯优化\n",
    "\n",
    "相较于随机网格搜索、对半网格搜索，贝叶斯过程带有先验过程。\n",
    "\n",
    "### 1. 贝叶斯优化基本流程\n",
    "\n",
    "1. 定义需要估计的f(x)及x的定义域\n",
    "2. 取出有限的n个x，求解对应f(x)观测值\n",
    "3. 根据有限的观测值，对函数估计（先验知识），得出该估计f上的目标值\n",
    "4. 定义某种规则，以确定下一个需要计算的观测点\n",
    "\n",
    "持续在2-4步骤中循环，直至假设分布上的目标值达到标准，或者所有计算资源被用完为止\n",
    "\n",
    ">1. 当贝叶斯优化不用于HPO时，一般f(x)是完全的黑盒函数\n",
    ">2. 在HPO中，自变量x就是超参数空间\n",
    ">3. 根据有限观测值、对函数分布进行估计的工具被称为概率代理模型。自带某些假设，可以根据有限个点估计出函数分布。默认使用基于搞死混合模型的TPE过程\n",
    "\n",
    "### 2. 贝叶斯优化的实现\n",
    "\n",
    "可以实现的库很多，如：https://www.automl.org/hpo-overview/hpo-tools/hpo-packages/\n",
    "\n",
    "| HPO库     | 优劣                                                         | 推荐指数 |\n",
    "| --------- | ------------------------------------------------------------ | -------- |\n",
    "| bayes_opt | `pip install bayesian-optimization`<br />实现基于高斯过程的贝叶斯优化<br />当参数空间由大量连续性参数构成时推荐<br /><br />包含大量离散型参数时避免使用<br />算力/时间稀缺时避免使用 | 2        |\n",
    "| hyperopt  | `pip install hyperopt`<br />实现基于TPE的贝叶斯优化<br />支持各类提效工具<br />进度条清晰，展示美观，较少怪异警告或报错<br />可推广至深度学习领域<br /><br />不支持基于高斯过程的贝叶斯优化<br />代码限制多、较为复杂、灵活性差 | 4        |\n",
    "| optuna    | `pip install optuna`<br />可实现基于各类算法的贝叶斯优化<br />代码最简洁，具有一定的灵活性<br />可推广至深度学习领域<br /><br />非关键功能维护不佳，有怪异警告与报错 | 4        |\n",
    "| Skopt     | `pip install scikit-optimize`<br />作为Optuna辅助包          |          |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc30e60f732a3636",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## 贰丨Bayes_Opt实现\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9295634297fd65b3",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "\n",
    "主要流程：\n",
    "\n",
    "第一步，定义目标函数的模板\n",
    "\n",
    "```python\n",
    "def objective():\n",
    "    # 评估器\n",
    "    reg = RandomForestRegressor()\n",
    "    \n",
    "    # 交叉验证\n",
    "    cv = KFold()\n",
    "    result = cross_validate(reg,cv)\n",
    "    \n",
    "    # 交叉验证结果\n",
    "    loss = result['test_rmse']    \n",
    "    \n",
    "    return loss\n",
    "```\n",
    "\n",
    "\n",
    "\n",
    "\n",
    ">1. 目标函数的输入必须是具体的超参数，而不是整个超参数空间，更不能是数据、算法等超参数外的元素。在定义目标函数时，需要让超参数作为目标函数输入\n",
    ">\n",
    ">2. 超参数输入只能是浮点数，不支持整数与字符串。\n",
    ">\n",
    ">3. Bayes_Opt只支持寻找f(x)最大值，不支持寻找最小值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6745be6509445b29",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-11-22T15:02:38.338542498Z",
     "start_time": "2023-11-22T15:02:38.325096753Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "from bayes_opt import BayesianOptimization\n",
    "from sklearn.model_selection import cross_validate,KFold\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "import numpy as np\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5a73edb5e6fa080d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-11-22T15:01:34.633736366Z",
     "start_time": "2023-11-22T15:01:34.610489785Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])\n"
     ]
    }
   ],
   "source": [
    "# 导入加利福尼亚房价数据\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data = fetch_california_housing()\n",
    "print(data.keys())\n",
    "\n",
    "# 划分数据集\n",
    "x = data['data']\n",
    "y = data['target']\n",
    "train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=22)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fd54396aecc48d70",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-11-22T15:02:16.442871752Z",
     "start_time": "2023-11-22T15:02:16.410038033Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# 第一步，定义目标函数\n",
    "def bayesopt_objective(n_estimators, max_depth, max_features, min_impurity_decrease):\n",
    "    # 定义评估器\n",
    "    # 需要调整的超参数等于目标函数的输入，不需要调整的超参数直接等于固定值\n",
    "    # 默认参数输入一定是浮点数，因此需要套上int函数处理成整数\n",
    "    reg = RandomForestRegressor(n_estimators=int(n_estimators),\n",
    "                                max_depth=int(max_depth),\n",
    "                                max_features=int(max_features),\n",
    "                                min_impurity_decrease=min_impurity_decrease,\n",
    "                                random_state=22,\n",
    "                                verbose=False,\n",
    "                                n_jobs=-1)\n",
    "\n",
    "    # 定义损失的输出，5折交叉验证下的结果，输出负根均方误差（-RMSE）\n",
    "    # 交叉验证需要使用数据，但不能让数据xy成为目标函数的输入\n",
    "    cv = KFold(n_splits=5, shuffle=True, random_state=22)\n",
    "    validation_loss = cross_validate(reg, train_x, train_y, scoring='neg_root_mean_squared_error', cv=cv,\n",
    "                                     error_score='raise')\n",
    "\n",
    "    # 交叉验证输出的评估指标是负根均方误差，因此本来就是负值\n",
    "    # 目标函数可以直接输出改损失的均值\n",
    "    return np.mean(validation_loss['test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64eb7e41b658df79",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "第二步，定义参数空间\n",
    "\n",
    "使用字典方式定义参数空间，取值范围为双向闭区间\n",
    "\n",
    "Bayes_Opt只支持填写参数的上下界，不支持填写步长等参数，且会讲所有参数都当做连续性超参数处理，因此Bayes_Opt会直接取出闭区间中任意浮点数作为备选参数（因此比其他贝叶斯优化库更大、更密，需要的迭代次数更多）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ad933fca478e0527",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-11-22T15:02:19.094012671Z",
     "start_time": "2023-11-22T15:02:19.057673958Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "param_grid_simple = {'n_estimators': (80, 100),\n",
    "                     'max_depth': (10, 25),\n",
    "                     'max_features': (10, 20),\n",
    "                     'min_impurity_decrease': (0, 1)}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9f846881475dbff",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "第三步，定义优化目标函数的具体流程\n",
    "\n",
    "在贝叶斯优化的实践中，都有涉及随机性的过程。在大部分优化库中，随机性无法控制。（即便是控制随机数种子，也无法固定优化过程，但选择出的超参数是可以复现的）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f697e17640c57efe",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-11-22T15:02:26.641839103Z",
     "start_time": "2023-11-22T15:02:26.593091378Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "def param_bayes_opt(init_points, n_iter):\n",
    "    # 定义优化器，先实例化优化器\n",
    "    opt = BayesianOptimization(bayesopt_objective,  # 目标函数\n",
    "                               param_grid_simple,  # 备选参数空间\n",
    "                               random_state=22  # 随机数种子(虽然无法控制过程)\n",
    "                               )\n",
    "    \n",
    "    # 使用优化器，Bayes_Opt只支持最大值\n",
    "    opt.maximize(init_points=init_points, # 抽取多少个初始观测值\n",
    "                 n_iter=n_iter # 一共观测/迭代多少次\n",
    "                 )\n",
    "    \n",
    "    # 优化完成，取出最佳参数和最佳分数\n",
    "    param_best = opt.max['params']\n",
    "    score_best = opt.max['target']\n",
    "    \n",
    "    # 打印最佳参数与最佳分数\n",
    "    print(f'Best Params: {param_best}\\n'\n",
    "          f'Best CVScore: {score_best}')\n",
    "    \n",
    "    return param_best,score_best"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e38941554b246d57",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "第四步，定义验证函数（非必须）\n",
    "\n",
    "在贝叶斯优化中，目标函数中交叉验证即数据分割都是可以规定好的，因此目标函数中设置了随机数种子，贝叶斯优化给出的最佳分数一定是与验证后分数相同"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "67489ed7ece64b3a",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-11-22T15:02:30.801573718Z",
     "start_time": "2023-11-22T15:02:30.752701957Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "def bayes_opt_validation(params_best):\n",
    "    reg = RandomForestRegressor(n_estimators=int(params_best['n_estimators']))\n",
    "    \n",
    "    cv = KFold(n_splits=5,shuffle=True,random_state=22)\n",
    "    validation_loss = cross_validate(reg, train_x,train_y,scoring='neg_root_mean_squared_error', cv=cv,verbose=False,njobs=-1)\n",
    "    return np.mean(validation_loss['test_score'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c269382146a2943",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "第五步，执行优化流程"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb489b5c4e9b5af6",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-11-22T15:07:58.106587520Z",
     "start_time": "2023-11-22T15:02:48.358106878Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "params_best,score_best = param_bayes_opt(20,280)\n",
    "print(f'耗时: {(time.time()-start)/60} min')\n",
    "# validation_score = bayes_opt_validation(params_best)\n",
    "# print(f'validation_score: {validation_score}')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
