如何加快交叉验证的速度How to speed up cross-validation

超参数对 Apache SparkML 模型进行优化时,会花费很长时间,具体取决于参数网格的大小。Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. 可以提高 SparkML 中交叉验证步骤的性能,从而提高性能:You can improve the performance of the cross-validation step in SparkML to speed things up:

  • 在运行任何功能转换或建模步骤(包括交叉验证)之前缓存数据。Cache the data before running any feature transformations or modeling steps, including cross-validation. 多次引用数据的进程受益于缓存。Processes that refer to the data multiple times benefit from a cache. 请记住调用上的操作, DataFrame 以便缓存生效。Remember to call an action on the DataFrame for the cache to take effect.
  • 增加中的并行度参数 CrossValidator ,此参数设置运行并行算法时要使用的线程数。Increase the parallelism parameter inside the CrossValidator, which sets the number of threads to use when running parallel algorithms. 默认设置为1。The default setting is 1. 有关详细信息,请参阅CrossValidator 文档See the CrossValidator documentation for more information.
  • 请勿将管道用作规范内的估计器 CrossValidatorDon’t use the pipeline as the estimator inside the CrossValidator specification. 在某些情况下,featurizers 与模型一起优化,在中运行整个管道是有意义的 CrossValidatorIn some cases where the featurizers are being tuned along with the model, running the whole pipeline inside the CrossValidator makes sense. 但是,这会为每个参数组合和折叠执行整个管道。However, this executes the entire pipeline for every parameter combination and fold. 因此,如果只优化模型,请将模型规范设置为内部的估计器 CrossValidatorTherefore, if only the model is being tuned, set the model specification as the estimator inside the CrossValidator.

备注

CrossValidator在 featurizers 后,可以将设置为管道中的最后一个阶段。CrossValidator can be set as the final stage inside the pipeline after the featurizers. 由标识的最佳模型 CrossValidator 为 output。The best model identified by the CrossValidator is output.