如何加速交叉驗證How to speed up cross-validation
超參數微調 Apache SparkML 模型需要很長的時間,視參數方格的大小而定。Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. 您可以改善 SparkML 中交叉驗證步驟的效能,以加快進度:You can improve the performance of the cross-validation step in SparkML to speed things up:
- 在執行任何功能轉換或模型步驟(包括交叉驗證)之前,請先快取資料。Cache the data before running any feature transformations or modeling steps, including cross-validation. 參考資料多次的進程會從快取中獲益。Processes that refer to the data multiple times benefit from a cache. 請記得在上呼叫動作,
DataFrame
讓快取生效。Remember to call an action on theDataFrame
for the cache to take effect. - 增加中的平行處理原則參數
CrossValidator
,以設定執行平行演算法時要使用的執行緒數目。Increase the parallelism parameter inside theCrossValidator
, which sets the number of threads to use when running parallel algorithms. 預設設定為1。The default setting is 1. 如需詳細資訊,請參閱 CrossValidator 檔 。See the CrossValidator documentation for more information. - 請勿使用管線作為規格內的估算器
CrossValidator
。Don’t use the pipeline as the estimator inside theCrossValidator
specification. 在有與模型一起調整的某些情況下,在中執行整個管線CrossValidator
會有意義。In some cases where the featurizers are being tuned along with the model, running the whole pipeline inside theCrossValidator
makes sense. 不過,這會針對每個參數組合和折迭執行整個管線。However, this executes the entire pipeline for every parameter combination and fold. 因此,如果只調整模型,請將模型規格設定為內的估算器CrossValidator
。Therefore, if only the model is being tuned, set the model specification as the estimator inside theCrossValidator
.
注意
CrossValidator
在有之後,可以設定為管線內的最後一個階段。CrossValidator
can be set as the final stage inside the pipeline after the featurizers. 所識別的最佳模型 CrossValidator
是輸出。The best model identified by the CrossValidator
is output.