您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

什么是映射数据流?What are mapping data flows?

映射数据流是在 Azure 数据工厂中以可视方式设计的数据转换。Mapping data flows are visually designed data transformations in Azure Data Factory. 数据流允许数据工程师开发图形数据转换逻辑,而无需编写代码。Data flows allow data engineers to develop graphical data transformation logic without writing code. 生成的数据流将作为使用扩展 Spark 群集的 Azure 数据工厂管道中的活动执行。The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Spark clusters. 数据流活动可以通过现有的数据工厂计划、控制、流和监视功能来操作化。Data flow activities can be operationalized via existing Data Factory scheduling, control, flow, and monitoring capabilities.

映射数据流提供完全直观的体验,无需编码。Mapping data flows provide a fully visual experience with no coding required. 数据流将在自己的执行群集上运行,以便进行扩展的数据处理。Your data flows will run on your own execution cluster for scaled-out data processing. Azure 数据工厂处理所有代码转换、路径优化和数据流作业的执行。Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.

入门Getting started

若要创建数据流,请选择 "工厂资源" 下的加号,然后选择 "数据流"。To create a data flow, select the plus sign under Factory Resources, and then select Data Flow.

新建数据流New data flow

这会将你转到可在其中创建转换逻辑的数据流画布。This takes you to the data flow canvas where you can create your transformation logic. 选择 "添加源" 开始配置源转换。Select Add source to start configuring your source transformation. 有关详细信息,请参阅源转换For more information, see Source transformation.

数据流画布Data flow canvas

数据流画布分为三部分:顶栏、图形和配置面板。The data flow canvas is separated into three parts: the top bar, the graph, and the configuration panel.

画布Canvas

GraphGraph

关系图显示转换流。The graph displays the transformation stream. 它显示源数据流入一个或多个接收器时的沿袭。It shows the lineage of source data as it flows into one or more sinks. 若要添加新源,请选择 "添加源"。To add a new source, select Add source. 若要添加新的转换,请选择现有转换右下方的加号。To add a new transformation, select the plus sign on the lower right of an existing transformation.

画布Canvas

Azure 集成运行时数据流属性Azure integration runtime data flow properties

调试按钮Debug button

开始在 ADF 中处理数据流时,需要为浏览器 UI 顶部的数据流启用 "调试" 开关。When you begin working with data flows in ADF, you will want to turn on the "Debug" switch for data flows at the top of the browser UI. 这会加速用于交互式调试、数据预览和管道调试执行的 Azure Databricks 群集。This will spin-up an Azure Databricks cluster to use for interactive debugging, data previews, and pipeline debug executions. 可以通过选择自定义Azure Integration Runtime来设置使用的群集的大小。You can set the size of the cluster being utilized by choosing a custom Azure Integration Runtime. 上次数据预览或上次调试管道执行后,调试会话将保持活动状态长达60分钟。The debug session will stay alive for up to 60 minutes after your last data preview or last debug pipeline execution.

当你使用数据流活动操作管道时,ADF 将使用与 "运行方式" 属性中的活动相关联的 Azure Integration Runtime。When you operationalize your pipelines with data flow activities, ADF will use the Azure Integration Runtime associated with the activity in the "Run On" property.

默认 Azure Integration Runtime 是一个小型四核单辅助角色节点群集,旨在允许您预览数据并快速执行最小成本的调试管道。The default Azure Integration Runtime is a small 4-core single worker node cluster intended to allow you to preview data and quickly execute debug pipelines at minimal costs. 如果要对大型数据集执行操作,请设置较大的 Azure IR 配置。Set a larger Azure IR configuration if you are performing operations against large datasets.

可以通过在 "Azure IR 数据流" 属性中设置 TTL 来指示 ADF 维护群集资源池(Vm)。You can instruct ADF to maintain a pool of cluster resources (VMs) by setting a TTL in the Azure IR data flow properties. 这将导致后续活动的执行速度更快。This will result in faster job execution on subsequent activities.

Azure 集成运行时和数据流策略Azure integration runtime and data flow strategies

并行执行数据流Execute data flows in parallel

如果并行执行管道中的数据流,则 ADF 会根据附加到每个活动的 Azure Integration Runtime 中的设置,为每个活动执行开启单独 Azure Databricks 的群集。If you execute data flows in a pipeline in parallel, ADF will spin-up separate Azure Databricks clusters for each activity execution based on the settings in your Azure Integration Runtime attached to each activity. 若要在 ADF 管道中设计并行执行,请在 UI 中添加无优先约束的数据流活动。To design parallel executions in ADF pipelines, add your data flow activities without precedence constraints in the UI.

在这三个选项中,此选项可能会在最短时间内执行。Of these three options, this option will likely execute in the shortest amount of time. 但是,将在单独的群集上同时执行每个并行数据流,因此事件的顺序是不确定的。However, each parallel data flow will execute at the same time on separate clusters, so the ordering of events is non-deterministic.

重载单一数据流Overload single data flow

如果将所有逻辑都置于单个数据流中,则 ADF 将在单个 Spark 群集实例上的同一作业执行上下文中执行。If you put all of your logic inside a single data flow, ADF will all execute in that same job execution context on a single Spark cluster instance.

此选项可能更难执行并排除故障,因为你的业务规则和业务逻辑将混杂在一起。This option can possibly be more difficult to follow and troubleshoot because your business rules and business logic will be jumble together. 此选项也不能提高可用性。This option also doesn't provide much re-usability.

按顺序执行数据流Execute data flows serially

如果在管道中执行序列中的数据流活动,并在 Azure IR 配置上设置了 TTL,则 ADF 会重复使用计算资源(Vm),从而加快后续执行时间。If you execute your data flow activities in serial in the pipeline and you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) resulting in faster subsequent execution times. 对于每次执行,你仍然会收到新的 Spark 上下文。You will still receive a new Spark context for each execution.

对于这三个选项,这可能需要最长的时间来执行端到端。Of these three options, this will likely take the longest time to execute end-to-end. 但它确实提供了每个数据流步骤中逻辑操作的完全分离。But it does provide a clean separation of logical operations in each data flow step.

配置面板Configuration panel

"配置" 面板显示特定于当前选定转换的设置。The configuration panel shows the settings specific to the currently selected transformation. 如果未选择任何转换,则会显示数据流。If no transformation is selected, it shows the data flow. 在整个数据流配置中,你可以在 "常规" 选项卡下编辑名称和说明,或通过 "参数" 选项卡添加参数。有关详细信息,请参阅映射数据流参数In the overall data flow configuration, you can edit the name and description under the General tab or add parameters via the Parameters tab. For more information, see Mapping data flow parameters.

每个转换至少具有四个配置选项卡。Each transformation has at least four configuration tabs.

转换设置Transformation settings

每个转换的配置窗格中的第一个选项卡包含特定于该转换的设置。The first tab in each transformation's configuration pane contains the settings specific to that transformation. 有关详细信息,请参阅该转换的文档页。For more information, see that transformation's documentation page.

源设置选项卡Source settings tab

优化Optimize

"优化" 选项卡包含配置分区方案的设置。The Optimize tab contains settings to configure partitioning schemes.

优化Optimize

默认设置为 "使用当前分区",指示 Azure 数据工厂使用对 Spark 上运行的数据流的本机分区方案。The default setting is Use current partitioning, which instructs Azure Data Factory to use the partitioning scheme native to data flows running on Spark. 在大多数情况下,建议采用此设置。In most scenarios, we recommend this setting.

在某些情况下,你可能需要调整分区。There are instances where you might want to adjust the partitioning. 例如,如果想要将转换输出到 lake 中的单个文件,请在接收器转换中选择 "单个分区"。For instance, if you want to output your transformations to a single file in the lake, select Single partition in a sink transformation.

您可能想要控制分区方案的另一种情况是优化性能。Another case where you might want to control the partitioning schemes is optimizing performance. 通过调整分区,可以控制跨计算节点和数据区域优化的数据分布,同时对整体数据流性能产生正面和负面影响。Adjusting the partitioning provides control over the distribution of your data across compute nodes and data locality optimizations that can have both positive and negative effects on your overall data flow performance. 有关详细信息,请参阅数据流性能指南For more information, see the Data flow performance guide.

若要更改任何转换的分区,请选择 "优化" 选项卡,然后选择 "设置分区" 单选按钮。To change the partitioning on any transformation, select the Optimize tab and select the Set Partitioning radio button. 然后,将显示一系列用于分区的选项。You'll then be presented with a series of options for partitioning. 分区的最佳方法会因数据量、候选键、null 值和基数的不同而不同。The best method of partitioning will differ based on your data volumes, candidate keys, null values, and cardinality.

最佳做法是从默认分区开始,然后尝试不同的分区选项。A best practice is to start with default partitioning and then try different partitioning options. 您可以使用管道调试运行进行测试,并在 "监视" 视图中查看每个转换分组中的执行时间和分区使用情况。You can test by using pipeline debug runs, and view execution time and partition usage in each transformation grouping from the monitoring view. 有关详细信息,请参阅监视数据流For more information, see Monitoring data flows.

以下分区选项可用。The following partitioning options are available.

轮循机制Round robin

轮循机制是一种简单的分区,可跨分区均匀地分布数据。Round robin is a simple partition that automatically distributes data equally across partitions. 如果没有合理的关键候选项来实现坚实的智能分区策略,请使用轮循机制。Use round robin when you don't have good key candidates to implement a solid, smart partitioning strategy. 可以设置物理分区数目。You can set the number of physical partitions.

哈希Hash

Azure 数据工厂将生成列的哈希来生成统一的分区,使包含类似值的行划归到同一分区。Azure Data Factory will produce a hash of columns to produce uniform partitions such that rows with similar values will fall in the same partition. 使用 Hash 选项时,请测试可能的分区偏差。When you use the Hash option, test for possible partition skew. 可以设置物理分区数目。You can set the number of physical partitions.

动态范围Dynamic range

动态范围将根据你提供的列或表达式使用 Spark 动态范围。Dynamic range will use Spark dynamic ranges based on the columns or expressions that you provide. 可以设置物理分区数目。You can set the number of physical partitions.

固定范围Fixed range

生成一个表达式,该表达式为分区数据列中的值提供固定范围。Build an expression that provides a fixed range for values within your partitioned data columns. 若要避免分区歪斜,应在使用此选项之前对数据有充分的了解。To avoid partition skew, you should have a good understanding of your data before you use this option. 为表达式输入的值将用作分区函数的一部分。The values you enter for the expression will be used as part of a partition function. 可以设置物理分区数目。You can set the number of physical partitions.

Key

如果您对数据的基数有充分了解,键分区可能是一个不错的策略。If you have a good understanding of the cardinality of your data, key partitioning might be a good strategy. 键分区将为列中的每个唯一值创建分区。Key partitioning will create partitions for each unique value in your column. 不能设置分区数,因为数字将基于数据中的唯一值。You can't set the number of partitions because the number will be based on unique values in the data.

一下Inspect

"检查" 选项卡可用于查看要转换的数据流的元数据。The Inspect tab provides a view into the metadata of the data stream that you're transforming. 您可以查看列计数、更改的列、添加的列、数据类型、列排序和列引用。You can see the column counts, columns changed, columns added, data types, column ordering, and column references. 检查是元数据的只读视图。Inspect is a read-only view of your metadata. 不需要启用调试模式即可在 "检查" 窗格中查看元数据。You don't need to have debug mode enabled to see metadata in the Inspect pane.

一下Inspect

通过转换更改数据形状时,"检查" 窗格中会显示元数据更改流。As you change the shape of your data through transformations, you'll see the metadata changes flow in the Inspect pane. 如果源转换中没有定义的架构,则元数据将不会显示在 "检查" 窗格中。If there isn't a defined schema in your source transformation, then metadata won't be visible in the Inspect pane. 缺少元数据在架构偏移方案中很常见。Lack of metadata is common in schema drift scenarios.

数据预览Data preview

如果调试模式为打开状态,则 "数据预览" 选项卡将在每次转换时提供数据的交互式快照。If debug mode is on, the Data Preview tab gives you an interactive snapshot of the data at each transform. 有关详细信息,请参阅调试模式下的数据预览For more information, see Data preview in debug mode.

顶部栏Top bar

顶部栏包含影响整个数据流的操作,如保存和验证。The top bar contains actions that affect the whole data flow, like saving and validation. 还可以通过使用 "显示关系图" 和 "隐藏关系图" 按钮,在关系图和配置模式间切换。You can also toggle between graph and configuration modes by using the Show Graph and Hide Graph buttons.

隐藏关系图Hide graph

如果隐藏了图形,可以通过 "上一步" 和 "下一步" 按钮,通过横向浏览转换节点。If you hide your graph, you can browse through your transformation nodes laterally via the Previous and Next buttons.

上一个和下一个按钮Previous and next buttons

后续步骤Next steps