如何加速交叉驗證How to speed up cross-validation

超參數微調 Apache SparkML 模型需要很長的時間,視參數方格的大小而定。Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. 您可以改善 SparkML 中交叉驗證步驟的效能,以加快進度:You can improve the performance of the cross-validation step in SparkML to speed things up:

  • 在執行任何功能轉換或模型步驟(包括交叉驗證)之前,請先快取資料。Cache the data before running any feature transformations or modeling steps, including cross-validation. 參考資料多次的進程會從快取中獲益。Processes that refer to the data multiple times benefit from a cache. 請記得在上呼叫動作, DataFrame 讓快取生效。Remember to call an action on the DataFrame for the cache to take effect.
  • 增加中的平行處理原則參數 CrossValidator ,以設定執行平行演算法時要使用的執行緒數目。Increase the parallelism parameter inside the CrossValidator, which sets the number of threads to use when running parallel algorithms. 預設設定為1。The default setting is 1. 如需詳細資訊,請參閱 CrossValidator 檔See the CrossValidator documentation for more information.
  • 請勿使用管線作為規格內的估算器 CrossValidatorDon’t use the pipeline as the estimator inside the CrossValidator specification. 在有與模型一起調整的某些情況下,在中執行整個管線 CrossValidator 會有意義。In some cases where the featurizers are being tuned along with the model, running the whole pipeline inside the CrossValidator makes sense. 不過,這會針對每個參數組合和折迭執行整個管線。However, this executes the entire pipeline for every parameter combination and fold. 因此,如果只調整模型,請將模型規格設定為內的估算器 CrossValidatorTherefore, if only the model is being tuned, set the model specification as the estimator inside the CrossValidator.

注意

CrossValidator 在有之後,可以設定為管線內的最後一個階段。CrossValidator can be set as the final stage inside the pipeline after the featurizers. 所識別的最佳模型 CrossValidator 是輸出。The best model identified by the CrossValidator is output.