您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

优化 HDInsight 中的 Apache Spark 作业Optimize Apache Spark jobs in HDInsight

本文简要介绍了用于优化 Azure HDInsight 上的 Apache Spark 作业的策略。This article provides an overview of strategies to optimize Apache Spark jobs on Azure HDInsight.

概述Overview

Apache Spark 作业的性能由多种因素而定。The performance of your Apache Spark jobs depends on multiple factors. 这些性能因素包括:数据的存储方式、群集的配置方式,以及处理数据时采用的操作。These performance factors include: how your data is stored, how the cluster is configured, and the operations that are used when processing the data.

你可能面临的常见挑战包括:由于未正确调整执行器的大小,长时间运行的操作以及导致笛卡尔运算的任务,导致内存约束。Common challenges you might face include: memory constraints due to improperly sized executors, long-running operations, and tasks that result in cartesian operations.

还有很多优化可帮助您克服这些挑战,例如缓存,并允许数据歪斜。There are also many optimizations that can help you overcome these challenges, such as caching, and allowing for data skew.

在下面的每篇文章中,可以找到有关 Spark 优化的各个方面的信息。In each of the following articles, you can find information on different aspects of Spark optimization.

后续步骤Next steps