您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

映射数据流是什么?What are mapping data flows?

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

映射数据流是在 Azure 数据工厂中以可视方式设计的数据转换。Mapping data flows are visually designed data transformations in Azure Data Factory. 数据流允许数据工程师开发图形数据转换逻辑,而无需编写代码。Data flows allow data engineers to develop graphical data transformation logic without writing code. 生成的数据流将作为使用向外 Apache Spark 群集的 Azure 数据工厂管道中的活动执行。The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. 数据流活动可以通过现有的数据工厂计划、控制、流和监视功能进行参与。Data flow activities can be engaged via existing Data Factory scheduling, control, flow, and monitoring capabilities.

映射数据流提供完全直观的体验,无需编码。Mapping data flows provide an entirely visual experience with no coding required. 数据流在执行群集上运行,用于扩展的数据处理。Your data flows run on your execution cluster for scaled-out data processing. Azure 数据工厂处理所有代码转换、路径优化和数据流作业的执行。Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.

体系结构Architecture

入门Getting started

若要创建数据流,请选择 "工厂资源" 下的加号,然后选择 "数据流"。To create a data flow, select the plus sign under Factory Resources, and then select Data Flow.

新建数据流New data flow

此操作会将你转到 "数据流" 画布,你可以在其中创建转换逻辑。This action takes you to the data flow canvas, where you can create your transformation logic. 选择 "添加源" 开始配置源转换。Select Add source to start configuring your source transformation. 有关详细信息,请参阅源转换For more information, see Source transformation.

数据流画布Data flow canvas

数据流画布分为三部分:顶栏、图形和配置面板。The data flow canvas is separated into three parts: the top bar, the graph, and the configuration panel.

画布Canvas

图形Graph

关系图显示转换流。The graph displays the transformation stream. 它显示源数据流入一个或多个接收器时的沿袭。It shows the lineage of source data as it flows into one or more sinks. 若要添加新源,请选择 "添加源"。To add a new source, select Add source. 若要添加新的转换,请选择现有转换右下方的加号。To add a new transformation, select the plus sign on the lower right of an existing transformation.

画布Canvas

Azure 集成运行时数据流属性Azure integration runtime data flow properties

“调试”按钮Debug button

开始在 ADF 中处理数据流时,需要打开浏览器 UI 顶部数据流的 "调试" 开关。When you begin working with data flows in ADF, you want to turn on the "Debug" switch for data flows at the top of the browser UI. 这会加速用于交互式调试、数据预览和管道调试执行的 Spark 群集。This spins-up a Spark cluster to use for interactive debugging, data previews, and pipeline debug executions. 可以通过选择自定义Azure Integration Runtime来设置使用的群集的大小。You can set the size of the cluster being utilized by choosing a custom Azure Integration Runtime. 上次数据预览或上次调试管道执行后,调试会话最长可维持60分钟。The debug session stays alive for up to 60 minutes after your last data preview or last debug pipeline execution.

当你使用数据流活动操作管道时,ADF 将使用与 "运行方式" 属性中的活动相关联的 Azure Integration Runtime。When you operationalize your pipelines with data flow activities, ADF uses the Azure Integration Runtime associated with the activity in the "Run On" property.

默认 Azure Integration Runtime 是一个小型四核单辅助角色节点群集,可用于预览数据,并以最小成本快速执行调试管道。The default Azure Integration Runtime is a small 4-core single worker node cluster that allows you to preview data and quickly execute debug pipelines at minimal costs. 如果要对大型数据集执行操作,请设置较大的 Azure IR 配置。Set a larger Azure IR configuration if you are performing operations against large datasets.

可以通过在 "Azure IR 数据流" 属性中设置 TTL 来指示 ADF 维护群集资源池(Vm)。You can instruct ADF to maintain a pool of cluster resources (VMs) by setting a TTL in the Azure IR data flow properties. 此操作会导致更快地执行后续活动的作业。This action results in faster job execution on subsequent activities.

Azure 集成运行时和数据流策略Azure integration runtime and data flow strategies

并行执行数据流Execute data flows in parallel

如果并行执行管道中的数据流,则 ADF 会根据附加到每个活动的 Azure Integration Runtime 中的设置,为每个活动执行旋转单独的 Spark 群集。If you execute data flows in a pipeline in parallel, ADF spins up separate Spark clusters for each activity execution based on the settings in your Azure Integration Runtime attached to each activity. 若要在 ADF 管道中设计并行执行,请在 UI 中添加无优先约束的数据流活动。To design parallel executions in ADF pipelines, add your data flow activities without precedence constraints in the UI.

在这三个选项中,此选项可能在最短时间内执行。Of these three options, this option likely executes in the shortest amount of time. 但是,每个并行数据流同时在不同的群集上执行,因此事件的顺序是不确定的。However, each parallel data flow executes at the same time on separate clusters, so the ordering of events is non-deterministic.

如果在管道内并行执行数据流活动,则建议不要使用 TTL。If you are executing your data flow activities in parallel inside your pipelines, it is recommended not to use TTL. 此操作是因为并行执行数据流同时使用相同的 Azure Integration Runtime 会导致为数据工厂提供多个温池实例。This action is because parallel executions of your data flow simultaneously using the same Azure Integration Runtime results in multiple warm pool instances for your data factory.

重载单一数据流Overload single data flow

如果将所有逻辑都置于单个数据流中,则 ADF 会在单个 Spark 群集实例上执行同一作业执行上下文。If you put all of your logic inside a single data flow, ADF executes that same job execution context on a single Spark cluster instance.

由于可以将业务规则和业务逻辑 jumbled 在一起,因此此选项的执行和故障排除可能更具挑战性。This option can be more challenging to follow and troubleshoot because your business rules and business logic can be jumbled together. 此选项也不能提供很多重用性。This option also doesn't provide much reusability.

按顺序执行数据流Execute data flows sequentially

如果在管道中按顺序执行数据流活动,并且已在 Azure IR 配置上设置了 TTL,则 ADF 将重复使用计算资源(Vm),从而加快后续执行时间。If you execute your data flow activities in sequence in the pipeline and you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) resulting in faster subsequent execution times. 对于每次执行,你仍然会收到新的 Spark 上下文。You will still receive a new Spark context for each execution.

在这三个选项中,此操作可能需要最长的时间来执行端到端。Of these three options, this action likely takes the longest time to execute end-to-end. 但它确实提供了每个数据流步骤中逻辑操作的完全分离。But it does provide a clean separation of logical operations in each data flow step.

配置面板Configuration panel

"配置" 面板显示特定于当前选定转换的设置。The configuration panel shows the settings specific to the currently selected transformation. 如果未选择任何转换,则会显示数据流。If no transformation is selected, it shows the data flow. 在整个数据流配置中,你可以在 "常规" 选项卡下编辑名称和说明,或通过 "参数" 选项卡添加参数。有关详细信息,请参阅映射数据流参数In the overall data flow configuration, you can edit the name and description under the General tab or add parameters via the Parameters tab. For more information, see Mapping data flow parameters.

每个转换至少包含四个配置选项卡。Each transformation contains at least four configuration tabs.

转换设置Transformation settings

每个转换的配置窗格中的第一个选项卡包含特定于该转换的设置。The first tab in each transformation's configuration pane contains the settings specific to that transformation. 有关详细信息,请参阅该转换的文档页。For more information, see that transformation's documentation page.

源设置选项卡Source settings tab

优化Optimize

"优化" 选项卡包含配置分区方案的设置。The Optimize tab contains settings to configure partitioning schemes.

优化Optimize

默认设置为 "使用当前分区",指示 Azure 数据工厂使用对 Spark 上运行的数据流的本机分区方案。The default setting is Use current partitioning, which instructs Azure Data Factory to use the partitioning scheme native to data flows running on Spark. 在大多数情况下,建议采用此设置。In most scenarios, we recommend this setting.

在某些情况下,你可能需要调整分区。There are instances where you might want to adjust the partitioning. 例如,如果想要将转换输出到 lake 中的单个文件,请在接收器转换中选择 "单个分区"。For instance, if you want to output your transformations to a single file in the lake, select Single partition in a sink transformation.

您可能想要控制分区方案的另一种情况是优化性能。Another case where you might want to control the partitioning schemes is optimizing performance. 通过调整分区,可以控制跨计算节点和数据区域优化的数据分布,同时对整体数据流性能产生正面和负面影响。Adjusting the partitioning provides control over the distribution of your data across compute nodes and data locality optimizations that can have both positive and negative effects on your overall data flow performance. 有关详细信息,请参阅数据流性能指南For more information, see the Data flow performance guide.

若要更改任何转换的分区,请选择 "优化" 选项卡,然后选择 "设置分区" 单选按钮。To change the partitioning on any transformation, select the Optimize tab and select the Set Partitioning radio button. 将显示一系列用于分区的选项。You are presented with a series of options for partitioning. 分区的最佳方法根据数据量、候选键、null 值和基数而有所不同。The best method of partitioning differs based on your data volumes, candidate keys, null values, and cardinality.

最佳做法是从默认分区开始,然后尝试不同的分区选项。A best practice is to start with default partitioning and then try different partitioning options. 您可以使用管道调试运行进行测试,并在 "监视" 视图中查看每个转换分组中的执行时间和分区使用情况。You can test by using pipeline debug runs, and view execution time and partition usage in each transformation grouping from the monitoring view. 有关详细信息,请参阅监视数据流For more information, see Monitoring data flows.

以下分区选项可用。The following partitioning options are available.

轮循机制Round robin

轮循机制是一种简单的分区,可跨分区均匀地分布数据。Round robin is a simple partition that automatically distributes data equally across partitions. 如果没有合理的关键候选项来实现坚实的智能分区策略,请使用轮循机制。Use round-robin when you don't have good key candidates to implement a solid, smart partitioning strategy. 可以设置物理分区数目。You can set the number of physical partitions.

哈希Hash

Azure 数据工厂生成列哈希,以生成统一分区,使具有相似值的行位于同一个分区中。Azure Data Factory produces a hash of columns to produce uniform partitions such that rows with similar values fall in the same partition. 使用 Hash 选项时,请测试可能的分区偏差。When you use the Hash option, test for possible partition skew. 可以设置物理分区数目。You can set the number of physical partitions.

动态范围Dynamic range

动态范围基于您提供的列或表达式使用 Spark 动态范围。The dynamic range uses Spark dynamic ranges based on the columns or expressions that you provide. 可以设置物理分区数目。You can set the number of physical partitions.

固定范围Fixed range

生成一个表达式,该表达式为分区数据列中的值提供固定范围。Build an expression that provides a fixed range for values within your partitioned data columns. 若要避免分区歪斜,应在使用此选项之前对数据有充分的了解。To avoid partition skew, you should have a good understanding of your data before you use this option. 为表达式输入的值将用作分区函数的一部分。The values you enter for the expression are used as part of a partition function. 可以设置物理分区数目。You can set the number of physical partitions.

Key

如果您对数据的基数有充分了解,键分区可能是一个不错的策略。If you have a good understanding of the cardinality of your data, key partitioning might be a good strategy. 键分区为列中的每个唯一值创建分区。Key partitioning creates partitions for each unique value in your column. 不能设置分区数,因为该数字基于数据中的唯一值。You can't set the number of partitions because the number is based on unique values in the data.

检查Inspect

"检查" 选项卡可用于查看要转换的数据流的元数据。The Inspect tab provides a view into the metadata of the data stream that you're transforming. 您可以查看列计数、列更改、添加的列、数据类型、列顺序和列引用。You can see column counts, the columns changed, the columns added, data types, the column order, and column references. 检查是元数据的只读视图。Inspect is a read-only view of your metadata. 不需要启用调试模式即可在 "检查" 窗格中查看元数据。You don't need to have debug mode enabled to see metadata in the Inspect pane.

检查Inspect

通过转换更改数据形状时,"检查" 窗格中会显示元数据更改流。As you change the shape of your data through transformations, you'll see the metadata changes flow in the Inspect pane. 如果源转换中没有定义的架构,则元数据将不会显示在 "检查" 窗格中。If there isn't a defined schema in your source transformation, then metadata won't be visible in the Inspect pane. 缺少元数据在架构偏移方案中很常见。Lack of metadata is common in schema drift scenarios.

数据预览Data preview

如果调试模式为打开状态,则 "数据预览" 选项卡将在每次转换时提供数据的交互式快照。If debug mode is on, the Data Preview tab gives you an interactive snapshot of the data at each transform. 有关详细信息,请参阅调试模式下的数据预览For more information, see Data preview in debug mode.

上栏Top bar

顶部栏包含影响整个数据流的操作,如保存和验证。The top bar contains actions that affect the whole data flow, like saving and validation. 还可以通过使用 "显示关系图" 和 "隐藏关系图" 按钮,在关系图和配置模式间切换。You can also toggle between graph and configuration modes by using the Show Graph and Hide Graph buttons.

隐藏图形Hide graph

如果隐藏了图形,可以通过 "上一步" 和 "下一步" 按钮,通过横向浏览转换节点。If you hide your graph, you can browse through your transformation nodes laterally via the Previous and Next buttons.

上一个和下一个按钮Previous and next buttons

后续步骤Next steps