您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用 Azure 数据工厂将数据复制到 Azure Blob 存储中或从 Azure Blob 存储中复制数据Copy data to or from Azure Blob Storage using Azure Data Factory

备注

本文适用于数据工厂版本 1。This article applies to version 1 of Data Factory. 如果使用当前版本数据工厂服务,请参阅 V2 中的 Azure Blob 存储连接器If you are using the current version of the Data Factory service, see Azure Blob Storage connector in V2.

本文介绍如何使用 Azure 数据工厂中的复制活动向/从 Azure Blob 存储复制数据。This article explains how to use the Copy Activity in Azure Data Factory to copy data to and from Azure Blob Storage. 它基于数据移动活动一文,其中总体概述了如何使用复制活动移动数据。It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.

概述Overview

可将数据从任一支持的源数据存储复制到 Azure Blob 存储,或从 Azure Blob 存储移动到任一支持的接收器数据存储。You can copy data from any supported source data store to Azure Blob Storage or from Azure Blob Storage to any supported sink data store. 下表列出了有关复制活动支持作为源或接收器的数据存储。The following table provides a list of data stores supported as sources or sinks by the copy activity. 例如,可以将数据 SQL Server 数据库或 Azure SQL 数据库中的数据库移动 azure blob 存储。For example, you can move data from a SQL Server database or a database in Azure SQL Database to an Azure blob storage. 并且,可以将数据 azure blob 存储复制 azure Synapse Analytics 或 Azure Cosmos DB 集合。And, you can copy data from Azure blob storage to Azure Synapse Analytics or an Azure Cosmos DB collection.

备注

本文已经过更新,以便使用 Azure Az PowerShell 模块。This article has been updated to use the Azure Az PowerShell module. 若要与 Azure 交互,建议使用的 PowerShell 模块是 Az PowerShell 模块。The Az PowerShell module is the recommended PowerShell module for interacting with Azure. 若要开始使用 Az PowerShell 模块,请参阅安装 Azure PowerShellTo get started with the Az PowerShell module, see Install Azure PowerShell. 若要了解如何迁移到 Az PowerShell 模块,请参阅 将 Azure PowerShell 从 AzureRM 迁移到 AzTo learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

支持的方案Supported scenarios

可以将数据 从 Azure Blob 存储 复制到以下数据存储:You can copy data from Azure Blob Storage to the following data stores:

类别Category 数据存储Data store
AzureAzure Azure Blob 存储Azure Blob storage
Azure Data Lake Storage Gen1Azure Data Lake Storage Gen1
Azure Cosmos DB (SQL API)Azure Cosmos DB (SQL API)
Azure SQL 数据库Azure SQL Database
Azure Synapse AnalyticsAzure Synapse Analytics
Azure 认知搜索索引Azure Cognitive Search Index
Azure 表存储Azure Table storage
数据库Databases SQL ServerSQL Server
OracleOracle
文件File 文件系统File system

可以将数据从以下数据存储复制 到 Azure Blob 存储You can copy data from the following data stores to Azure Blob Storage:

类别Category 数据存储Data store
AzureAzure Azure Blob 存储Azure Blob storage
Azure Cosmos DB (SQL API)Azure Cosmos DB (SQL API)
Azure Data Lake Storage Gen1Azure Data Lake Storage Gen1
Azure SQL 数据库Azure SQL Database
Azure Synapse AnalyticsAzure Synapse Analytics
Azure 表存储Azure Table storage
数据库Databases Amazon RedshiftAmazon Redshift
DB2DB2
MySQLMySQL
OracleOracle
PostgreSQLPostgreSQL
SAP Business WarehouseSAP Business Warehouse
SAP HANASAP HANA
SQL ServerSQL Server
SybaseSybase
TeradataTeradata
NoSQLNoSQL CassandraCassandra
MongoDBMongoDB
文件File Amazon S3Amazon S3
文件系统File system
FTPFTP
HDFSHDFS
SFTPSFTP
其他Others 泛型 HTTPGeneric HTTP
泛型 ODataGeneric OData
泛型 ODBCGeneric ODBC
SalesforceSalesforce
Web 表(HTML 中的表)Web table (table from HTML)

重要

复制活动支持从常规用途的 Azure 存储帐户和冷/热 Blob 存储中复制数据,以及将数据复制到常规用途的 Azure 存储帐户和冷/热 Blob 存储。Copy Activity supports copying data from/to both general-purpose Azure Storage accounts and Hot/Cool Blob storage. 活动支持 从块 blob、追加 blob 或页 blob 进行读取,但支持 仅写入块 blobThe activity supports reading from block, append, or page blobs, but supports writing to only block blobs. 不支持将 Azure 高级存储用作接收器,因为它由页 Blob 支持。Azure Premium Storage is not supported as a sink because it is backed by page blobs.

复制活动在源数据成功复制到目标位置后不会删除该数据。Copy Activity does not delete data from the source after the data is successfully copied to the destination. 如果需要在成功复制后删除源数据,请创建一个 自定义活动 以删除数据并在管道中使用该活动。If you need to delete source data after a successful copy, create a custom activity to delete the data and use the activity in the pipeline. 相关示例请参阅 GitHub 上的删除 Blob 或文件夹示例For an example, see the Delete blob or folder sample on GitHub.

入门Get started

可以使用不同的工具/API 创建包含复制活动的管道,此管道将数据移入/移出 Azure Blob 存储。You can create a pipeline with a copy activity that moves data to/from an Azure Blob Storage by using different tools/APIs.

创建管道的最简单方法是使用复制向导。The easiest way to create a pipeline is to use the Copy Wizard. 本文演练了如何创建管道,以便将数据从一个 Azure Blob 存储位置复制到另一个 Azure Blob 存储位置。This article has a walkthrough for creating a pipeline to copy data from an Azure Blob Storage location to another Azure Blob Storage location. 有关如何创建管道将数据从 Azure Blob 存储复制到 Azure SQL 数据库的教程,请参阅教程:使用复制向导创建管道For a tutorial on creating a pipeline to copy data from an Azure Blob Storage to Azure SQL Database, see Tutorial: Create a pipeline using Copy Wizard.

你还可以使用以下工具创建管道: Visual StudioAzure PowerShellAZURE 资源管理器模板.net APIREST APIYou can also use the following tools to create a pipeline: Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. 有关创建包含复制活动的管道的分步说明,请参阅复制活动教程See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.

无论使用工具还是 API,执行以下步骤都可创建管道,以便将数据从源数据存储移到接收器数据存储:Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:

  1. 创建 数据工厂Create a data factory. 数据工厂可以包含一个或多个管道。A data factory may contain one or more pipelines.
  2. 创建 链接服务 以将输入和输出数据存储链接到数据工厂。Create linked services to link input and output data stores to your data factory. 例如,如果要将数据从 Azure blob 存储复制到 Azure SQL 数据库,请创建两个链接服务,将 Azure 存储帐户和 Azure SQL 数据库链接到数据工厂。For example, if you are copying data from an Azure blob storage to Azure SQL Database, you create two linked services to link your Azure storage account and Azure SQL Database to your data factory. 有关特定于 Azure Blob 存储的链接服务属性,请参阅链接服务属性部分。For linked service properties that are specific to Azure Blob Storage, see linked service properties section.
  3. 创建用于表示复制操作的输入和输出数据的 数据集Create datasets to represent input and output data for the copy operation. 在上一个步骤所述的示例中,创建了一个数据集来指定 Blob 容器和包含输入数据的文件夹。In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. 另外,还可以创建另一个数据集来指定 Azure SQL 数据库中的 SQL 表,以保存从 blob 存储复制的数据。And, you create another dataset to specify the SQL table in Azure SQL Database that holds the data copied from the blob storage. 有关特定于 Azure Blob 存储的数据集属性,请参阅数据集属性部分。For dataset properties that are specific to Azure Blob Storage, see dataset properties section.
  4. 创建包含复制活动的 管道 ,该活动将数据集作为输入,并将数据集作为输出。Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. 在前面所述的示例中,在复制活动中使用 BlobSource 作为源,SqlSink 作为接收器。In the example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity. 同样,如果从 Azure SQL 数据库复制到 Azure Blob 存储,则在复制活动中使用 SqlSource 和 BlobSink。Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and BlobSink in the copy activity. 有关特定于 Azure Blob 存储的复制活动属性,请参阅复制活动属性部分。For copy activity properties that are specific to Azure Blob Storage, see copy activity properties section. 有关如何将数据存储用作源或接收器的详细信息,请单击前面章节中的相应数据存储链接。For details on how to use a data store as a source or a sink, click the link in the previous section for your data store.

使用向导时,会自动创建这些数据工厂实体(链接服务、数据集和管道)的 JSON 定义。When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. 使用工具/API(.NET API 除外)时,使用 JSON 格式定义这些数据工厂实体。When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. 有关用于向/从 Azure Blob 存储复制数据的数据工厂实体的 JSON 定义示例,请参阅本文的 JSON 示例部分。For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Blob Storage, see JSON examples section of this article.

对于特定于 Azure Blob 存储的数据工厂实体,以下部分提供了有关用于定义这些实体的 JSON 属性的详细信息。The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Blob Storage.

链接服务属性Linked service properties

有两种类型的链接服务,可用于将 Azure 存储链接到 Azure 数据工厂。There are two types of linked services you can use to link an Azure Storage to an Azure data factory. 它们是:AzureStorage 链接服务和 AzureStorageSas 链接服务。They are: AzureStorage linked service and AzureStorageSas linked service. Azure 存储链接服务为数据工厂提供 Azure 存储的全局访问权限。The Azure Storage linked service provides the data factory with global access to the Azure Storage. 而 Azure 存储 SAS(共享访问签名)链接服务为数据工厂提供 Azure 存储的受限/有时限的访问。Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. 这两种链接服务之间没有其他区别。There are no other differences between these two linked services. 请选择适合你需求的链接服务。Choose the linked service that suits your needs. 以下各部分提供了有关这两种链接服务的详细信息。The following sections provide more details on these two linked services.

Azure 存储链接服务Azure Storage Linked Service

Azure 存储链接服务 可让你使用 帐户密钥(为数据工厂提供 Azure 存储的全局访问权限)将 Azure 存储帐户链接到 Azure 数据工厂。The Azure Storage linked service allows you to link an Azure storage account to an Azure data factory by using the account key, which provides the data factory with global access to the Azure Storage. 下表提供 Azure 存储链接服务专属 JSON 元素的描述。The following table provides description for JSON elements specific to Azure Storage linked service.

属性Property 说明Description 必须Required
typetype type 属性必须设置为:AzureStorageThe type property must be set to: AzureStorage Yes
connectionStringconnectionString 为 connectionString 属性指定连接到 Azure 存储所需的信息。Specify information needed to connect to Azure storage for the connectionString property. Yes

有关如何检索存储帐户访问密钥的信息,请参阅管理存储帐户访问密钥For information about how to retrieve the storage account access keys, see Manage storage account access keys.

示例:Example:

{
    "name": "StorageLinkedService",
    "properties": {
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
        }
    }
}

Azure 存储 SAS 链接服务Azure Storage Sas Linked Service

共享访问签名 (SAS) 用于对存储帐户中的资源进行委托访问。A shared access signature (SAS) provides delegated access to resources in your storage account. 这样便可以授权客户端在指定时间段内,以一组指定权限有限地访问存储帐户中的对象,而不必共享帐户访问密钥。It allows you to grant a client limited permissions to objects in your storage account for a specified period of time and with a specified set of permissions, without having to share your account access keys. SAS 是在其查询参数中包含对存储资源进行验证了身份的访问所需的所有信息的 URI。The SAS is a URI that encompasses in its query parameters all the information necessary for authenticated access to a storage resource. 要使用 SAS 访问存储资源,客户端只需将 SAS 传入到相应的构造函数或方法。To access storage resources with the SAS, the client only needs to pass in the SAS to the appropriate constructor or method. 有关 SAS 的详细信息,请参阅使用共享访问签名 (SAS) 授予对 Azure 存储资源的有限访问权限For more information about SAS, see Grant limited access to Azure Storage resources using shared access signatures (SAS).

重要

Azure 数据工厂现仅支持 服务 SAS,而不支持帐户 SAS。Azure Data Factory now only supports Service SAS but not Account SAS. 请注意,通过 Azure 门户或存储资源管理器生成的 SAS URL 是不受支持的帐户 SAS。Note the SAS URL generable from Azure portal or Storage Explorer is an Account SAS, which is not supported.

提示

可执行以下 PowerShell 命令为存储帐户生成服务 SAS(替换占位符并授予所需权限):$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey> New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime <startTime> -ExpiryTime <endTime> -FullUriYou can execute below PowerShell commands to generate a Service SAS for your storage account (replace the place-holders and grant the needed permission): $context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey> New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime <startTime> -ExpiryTime <endTime> -FullUri

Azure 存储 SAS 链接服务可让你使用共享访问签名 (SAS) 将 Azure 存储帐户链接到 Azure 数据工厂。The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by using a Shared Access Signature (SAS). 这样,便可以将存储中所有/特定资源(Blob/容器)的受限/限时访问权限提供给数据工厂。It provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage. 下表提供了 Azure 存储 SAS 链接服务特定的 JSON 元素的描述。The following table provides description for JSON elements specific to Azure Storage SAS linked service.

属性Property 说明Description 必须Required
typetype type 属性必须设置为:AzureStorageSasThe type property must be set to: AzureStorageSas Yes
sasUrisasUri 指定 Azure 存储资源(例如 Blob、容器或表)的共享访问签名 URI。Specify Shared Access Signature URI to the Azure Storage resources such as blob, container, or table. Yes

示例:Example:

{
    "name": "StorageSasLinkedService",
    "properties": {
        "type": "AzureStorageSas",
        "typeProperties": {
            "sasUri": "<Specify SAS URI of the Azure Storage resource>"
        }
    }
}

创建 SAS URI 时,请注意以下事项:When creating an SAS URI, considering the following:

  • 根据链接服务(读取、写入、读/写)在数据工厂中的用法,设置针对对象的适当读/写 权限Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is used in your data factory.
  • 根据需要设置“到期时间”。Set Expiry time appropriately. 确保 Azure 存储对象的访问权限不会在管道的活动期限内过期。Make sure that the access to Azure Storage objects does not expire within the active period of the pipeline.
  • 应该根据需要在正确的容器/Blob 或表级别创建 URI。Uri should be created at the right container/blob or Table level based on the need. 数据工厂服务可以使用 Azure Blob 的 SAS URI 访问特定的 Blob。A SAS Uri to an Azure blob allows the Data Factory service to access that particular blob. 数据工厂服务可以使用 Azure Blob 容器的 SAS URI 迭代该容器中的 Blob。A SAS Uri to an Azure blob container allows the Data Factory service to iterate through blobs in that container. 如果稍后需要提供更多/更少对象的访问权限或需要更新 SAS URI,请记得使用新 URI 更新链接服务。If you need to provide access more/fewer objects later, or update the SAS URI, remember to update the linked service with the new URI.

数据集属性Dataset properties

要指定数据集来表示 Azure Blob 存储中的输入或输出数据,可以将数据集的类型属性设置为:AzureBlobTo specify a dataset to represent input or output data in an Azure Blob Storage, you set the type property of the dataset to: AzureBlob. 将数据集的 linkedServiceName 属性设置为 Azure 存储或 Azure 存储 SAS 链接服务的名称。Set the linkedServiceName property of the dataset to the name of the Azure Storage or Azure Storage SAS linked service. 数据集的 type 属性指定 Blob 存储中的 Blob 容器文件夹The type properties of the dataset specify the blob container and the folder in the blob storage.

有关可用于定义数据集的 JSON 部分和属性的完整列表,请参阅创建数据集一文。For a full list of JSON sections & properties available for defining datasets, see the Creating datasets article. 对于所有数据集类型(Azure SQL、Azure Blob、Azure 表等),结构、可用性和数据集 JSON 的策略等部分均类似。Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).

数据工厂支持以下符合 CLS 标准的基于 .NET 的类型值,用于在 "结构" 中为读取数据源(如 Azure blob: Int16、Int32、Int64、Single、Double、Decimal、Byte []、Bool、String、Guid、Datetime、Datetimeoffset、Timespan)提供类型信息。Data factory supports the following CLS-compliant .NET based type values for providing type information in "structure" for schema-on-read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan. 将数据从源数据存储移到接收器数据存储时,数据工厂自动执行类型转换。Data Factory automatically performs type conversions when moving data from a source data store to a sink data store.

每种数据集的 typeProperties 节有所不同,该部分提供有关数据在数据存储区中的位置、格式等信息。The typeProperties section is different for each type of dataset and provides information about the location, format etc., of the data in the data store. AzureBlob 类型的数据集的 typeProperties 部分具有以下属性:The typeProperties section for dataset of type AzureBlob dataset has the following properties:

propertiesProperty 说明Description 必须Required
folderPathfolderPath 到 Blob 存储中的容器和文件夹的路径。Path to the container and folder in the blob storage. 示例:myblobcontainer\myblobfolderExample: myblobcontainer\myblobfolder\ Yes
fileNamefileName blob 的名称。Name of the blob. fileName 可选,并且区分大小写。fileName is optional and case-sensitive.

如果指定文件名,则活动(包括复制)将对特定 Blob 起作用。If you specify a filename, the activity (including Copy) works on the specific Blob.

如果未指定 fileName,则复制将包括输入数据集的 folderPath 中所有的 Blob。When fileName is not specified, Copy includes all Blobs in the folderPath for input dataset.

如果没有为输出数据集指定 fileName ,并且没有在活动接收器中指定 preserveHierarchy ,则生成的文件的名称将采用以下格式: Data.<Guid>.txt (: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txtWhen fileName is not specified for an output dataset and preserveHierarchy is not specified in activity sink, the name of the generated file would be in the following this format: Data.<Guid>.txt (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt
No
partitionedBypartitionedBy partitionedBy 是一个可选属性。partitionedBy is an optional property. 它可用于指定时序数据的动态 folderPath 和 filename。You can use it to specify a dynamic folderPath and filename for time series data. 例如,folderPath 可针对每小时的数据参数化。For example, folderPath can be parameterized for every hour of data. 请参阅使用 partitionedBy 属性部分,了解详细信息和示例。See the Using partitionedBy property section for details and examples. No
formatformat 支持以下格式类型:TextFormatJsonFormatAvroFormatOrcFormatParquetFormatThe following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. 请将格式中的“type”属性设置为上述值之一。Set the type property under format to one of these values. 有关详细信息,请参阅文本格式Json 格式Avro 格式Orc 格式Parquet 格式部分。For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections.

如果想要在基于文件的存储之间按原样复制文件(二进制副本),可以在输入和输出数据集定义中跳过格式节。If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions.
No
compressioncompression 指定数据的压缩类型和级别。Specify the type and level of compression for the data. 支持的类型包括:GZipDeflateBZip2ZipDeflateSupported types are: GZip, Deflate, BZip2, and ZipDeflate. 支持的级别为:“最佳”和“最快” 。Supported levels are: Optimal and Fastest. 有关详细信息,请参阅 Azure 数据工厂中的文件和压缩格式For more information, see File and compression formats in Azure Data Factory. No

使用 partitionedBy 属性Using partitionedBy property

如上一部分所述,可以使用 partitionedBy 属性、数据工厂函数和系统变量指定时序数据的动态 folderPath 和 filename。As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the partitionedBy property, Data Factory functions, and the system variables.

有关时序数据集、计划和切片的详细信息,请参阅文章创建数据集计划和执行For more information on time series datasets, scheduling, and slices, see Creating Datasets and Scheduling & Execution articles.

示例 1Sample 1

"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
    { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],

在本示例中,{Slice} 已替换为数据工厂系统变量 SliceStart 的值,格式为指定的指定的格式 (YYYYMMDDHH)。In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. SliceStart 指切片开始时间。The SliceStart refers to start time of the slice. 每个切片的 folderPath 不同。The folderPath is different for each slice. 例如:wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104

示例 2Sample 2

"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
    { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
    { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
    { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
    { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],

在本示例中,SliceStart 的年、月、日和时间已提取到 folderPath 和 fileName 属性使用的各个变量。In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties.

复制活动属性Copy activity properties

有关可用于定义活动的节和属性的完整列表,请参阅创建管道一文。For a full list of sections & properties available for defining activities, see the Creating Pipelines article. 名称、说明、输入和输出数据集等属性和策略可用于所有类型的活动。Properties such as name, description, input and output datasets, and policies are available for all types of activities. 但是,可用于此活动的 typeProperties 节的属性因每个活动类型而异。Whereas, properties available in the typeProperties section of the activity vary with each activity type. 对于复制活动,这些属性则因源和接收器的类型而异。For Copy activity, they vary depending on the types of sources and sinks. 要从 Azure Blob 存储移动数据,请在复制活动中将源类型设置为 BlobSourceIf you are moving data from an Azure Blob Storage, you set the source type in the copy activity to BlobSource. 同样,要将数据移动到 Azure Blob 存储,请在复制活动中将接收器类型设置为 BlobSinkSimilarly, if you are moving data to an Azure Blob Storage, you set the sink type in the copy activity to BlobSink. 本部分提供 BlobSource 和 BlobSink 支持的属性列表。This section provides a list of properties supported by BlobSource and BlobSink.

BlobSource 支持 typeProperties 部分的以下属性:BlobSource supports the following properties in the typeProperties section:

propertiesProperty 说明Description 允许的值Allowed values 必须Required
recursiverecursive 指示是要从子文件夹中以递归方式读取数据,还是只从指定的文件夹中读取数据。Indicates whether the data is read recursively from the sub folders or only from the specified folder. True(默认值)、FalseTrue (default value), False No

BlobSink 支持以下 typeProperties 属性部分:BlobSink supports the following properties typeProperties section:

propertiesProperty 说明Description 允许的值Allowed values 必须Required
copyBehaviorcopyBehavior 源为 BlobSource 或 FileSystem 时,请定义复制行为。Defines the copy behavior when the source is BlobSource or FileSystem. PreserveHierarchy:保留目标文件夹中的文件层次结构。PreserveHierarchy: preserves the file hierarchy in the target folder. 从源文件到源文件夹的相对路径与从目标文件到目标文件夹的相对路径相同。The relative path of source file to source folder is identical to the relative path of target file to target folder.

FlattenHierarchy:源文件夹中的所有文件都位于目标文件夹的第一级。FlattenHierarchy: all files from the source folder are in the first level of target folder. 目标文件具有自动生成的名称。The target files have auto generated name.

MergeFiles:将源文件夹中的所有文件合并到一个文件中。MergeFiles: merges all files from the source folder to one file. 如果指定文件/Blob 名称,则合并的文件名称将为指定的名称;否则,会自动生成文件名。If the File/Blob Name is specified, the merged file name would be the specified name; otherwise, would be auto-generated file name.
No

BlobSource 还支持向后兼容性的这两种属性。BlobSource also supports these two properties for backward compatibility.

  • treatEmptyAsNull:指定是否将 Null 或空字符串视为 Null 值。treatEmptyAsNull: Specifies whether to treat null or empty string as null value.
  • skipHeaderLineCount - 指定需要跳过的行数。skipHeaderLineCount - Specifies how many lines need be skipped. 仅适用于输入数据集使用 TextFormat 的情况。It is applicable only when input dataset is using TextFormat.

同样,BlobSink 支持以下向后兼容性的属性。Similarly, BlobSink supports the following property for backward compatibility.

  • blobWriterAddHeader:指定在写入到输出数据集时,是否添加列定义的标头。blobWriterAddHeader: Specifies whether to add a header of column definitions while writing to an output dataset.

数据集目前支持实现相同功能的以下属性:treatEmptyAsNullskipLineCountfirstRowAsHeaderDatasets now support the following properties that implement the same functionality: treatEmptyAsNull, skipLineCount, firstRowAsHeader.

下表提供了如何使用新数据集属性替代这些 blob 源/接收器属性的指南。The following table provides guidance on using the new dataset properties in place of these blob source/sink properties.

复制活动属性Copy Activity property 数据集属性Dataset property
BlobSource 上的 skipHeaderLineCountskipHeaderLineCount on BlobSource skipLineCount 和 firstRowAsHeader。skipLineCount and firstRowAsHeader. 首先跳过行,然后读取第一行作为标头。Lines are skipped first and then the first row is read as a header.
BlobSource 上的 treatEmptyAsNulltreatEmptyAsNull on BlobSource 输入数据集上的 treatEmptyAsNulltreatEmptyAsNull on input dataset
BlobSink 上的 blobWriterAddHeaderblobWriterAddHeader on BlobSink 输出数据集上的 firstRowAsHeaderfirstRowAsHeader on output dataset

有关这些属性的详细信息,请参阅指定 TextFormatSee Specifying TextFormat section for detailed information on these properties.

recursive 和 copyBehavior 示例recursive and copyBehavior examples

本节介绍了将 recursive 和 copyBehavior 值进行不同组合所产生的复制操作行为。This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values.

recursiverecursive copyBehaviorcopyBehavior 产生的行为Resulting behavior
true preserveHierarchypreserveHierarchy 对于具有以下结构的源文件夹 Folder1:For a source folder Folder1 with the following structure:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5

使用与源相同的结构创建目标文件夹 Folder1the target folder Folder1 is created with the same structure as the source

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5。        File5.
true flattenHierarchyflattenHierarchy 对于具有以下结构的源文件夹 Folder1:For a source folder Folder1 with the following structure:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5

使用以下结构创建目标 Folder1:the target Folder1 is created with the following structure:

Folder1Folder1
    File1 的自动生成名称    auto-generated name for File1
    File2 的自动生成名称    auto-generated name for File2
    File3 的自动生成名称    auto-generated name for File3
    File4 的自动生成名称    auto-generated name for File4
    File5 的自动生成名称    auto-generated name for File5
truetrue mergeFilesmergeFiles 对于具有以下结构的源文件夹 Folder1:For a source folder Folder1 with the following structure:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5

使用以下结构创建目标 Folder1:the target Folder1 is created with the following structure:

Folder1Folder1
    File1 + File2 + File3 + File4 + File 5 的内容将合并到一个文件中,且自动生成文件名    File1 + File2 + File3 + File4 + File 5 contents are merged into one file with auto-generated file name
falsefalse preserveHierarchypreserveHierarchy 对于具有以下结构的源文件夹 Folder1:For a source folder Folder1 with the following structure:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5

使用以下结构创建目标文件夹 Folder1the target folder Folder1 is created with the following structure

Folder1Folder1
    File1    File1
    File2    File2


不会选取包含 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 are not picked up.
falsefalse flattenHierarchyflattenHierarchy 对于具有以下结构的源文件夹 Folder1:For a source folder Folder1 with the following structure:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5

使用以下结构创建目标文件夹 Folder1the target folder Folder1 is created with the following structure

Folder1Folder1
    File1 的自动生成名称    auto-generated name for File1
    File2 的自动生成名称    auto-generated name for File2


不会选取包含 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 are not picked up.
falsefalse mergeFilesmergeFiles 对于具有以下结构的源文件夹 Folder1:For a source folder Folder1 with the following structure:

Folder1Folder1
    File1    File1
    File2    File2
    Subfolder1    Subfolder1
        File3        File3
        File4        File4
        File5        File5

使用以下结构创建目标文件夹 Folder1the target folder Folder1 is created with the following structure

Folder1Folder1
    File1 + File2 的内容将合并到一个文件中,且自动生成文件名。    File1 + File2 contents are merged into one file with auto-generated file name. File1 的自动生成名称auto-generated name for File1

不会选取包含 File3、File4 和 File5 的 Subfolder1。Subfolder1 with File3, File4, and File5 are not picked up.

演练:使用“复制向导”将数据复制到 Blob 存储/从 Blob 存储复制数据Walkthrough: Use Copy Wizard to copy data to/from Blob Storage

让我们看一下如何快速将数据复制到 Azure Blob 存储/从 Azure Blob 存储复制数据。Let's look at how to quickly copy data to/from an Azure blob storage. 在本演练中,源和目标数据存储都属于 Azure Blob 存储类型。In this walkthrough, both source and destination data stores of type: Azure Blob Storage. 本演练中的管道将数据从一个文件夹复制到同一 blob 容器中的其他文件夹中。The pipeline in this walkthrough copies data from a folder to another folder in the same blob container. 本演练有意简单设计,以显示使用 Blob 存储作为源或接收器时的设置或属性。This walkthrough is intentionally simple to show you settings or properties when using Blob Storage as a source or sink.

先决条件Prerequisites

  1. 如果尚无 Azure 存储帐户,请创建一个通用 Azure 存储帐户。Create a general-purpose Azure Storage Account if you don't have one already. 在本演练中,使用 blob 存储同时作为源和目标数据存储。You use the blob storage as both source and destination data store in this walkthrough. 如果没有 Azure 存储帐户,请参阅创建存储帐户一文获取创建步骤。if you don't have an Azure storage account, see the Create a storage account article for steps to create one.
  2. 在存储帐户中创建名为 adfblobconnector 的 Blob 容器。Create a blob container named adfblobconnector in the storage account.
  3. 在 adfblobconnector 容器中创建名为 input 的文件夹。Create a folder named input in the adfblobconnector container.
  4. 创建含以下内容且名为 emp.txt 的文件,并使用 Azure 存储资源管理器等工具将其上传到 input 文件夹Create a file named emp.txt with the following content and upload it to the input folder by using tools such as Azure Storage Explorer
    John, Doe
    Jane, Doe
    

创建数据工厂Create the data factory

  1. 登录 Azure 门户Sign in to the Azure portal.
  2. 单击左上角的“创建资源”,单击“智能 + 分析”,然后单击“数据工厂”。Click Create a resource from the top-left corner, click Intelligence + analytics, and click Data Factory.
  3. 在“新建数据工厂”窗格中:In the New data factory pane:
    1. 输入 ADFBlobConnectorDF作为名称。Enter ADFBlobConnectorDF for the name. Azure 数据工厂的名称必须全局唯一。The name of the Azure data factory must be globally unique. 如果收到错误“*Data factory name "ADFBlobConnectorDF" is not available”,请更改数据工厂的名称(例如改为 yournameADFBlobConnectorDF),并重新尝试创建。If you receive the error: *Data factory name "ADFBlobConnectorDF" is not available, change the name of the data factory (for example, yournameADFBlobConnectorDF) and try creating again. 有关数据工厂项目命名规则,请参阅 Data Factory - Naming Rules (数据工厂 - 命名规则)主题。See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
    2. 选择 Azure 订阅Select your Azure subscription.
    3. 对于资源组,选择“使用现有”以选择现有资源组(或)选择“新建”以输入资源组的名称。For Resource Group, select Use existing to select an existing resource group (or) select Create new to enter a name for a resource group.
    4. 选择数据工厂的 位置Select a location for the data factory.
    5. 选中位于边栏选项卡底部的“固定到仪表板”复选框。Select Pin to dashboard check box at the bottom of the blade.
    6. 单击“创建”。Click Create.
  4. 创建完成后,会看到 " 数据工厂 " 边栏选项卡,如下图所示: "  数据工厂" 主页After the creation is complete, you see the Data Factory blade as shown in the following image: Data factory home page

复制向导Copy Wizard

  1. 在“数据工厂”主页上,单击“复制数据”磁贴,在单独的选项卡上启动“复制数据向导”。On the Data Factory home page, click the Copy data tile to launch Copy Data Wizard in a separate tab.

    备注

    如果 Web 浏览器卡在“正在授权...”处,请禁用或取消选中“阻止第三方 Cookie 和站点数据”设置,或在保持启用的状态下为 login.microsoftonline.com 创建一个例外,然后尝试再次启动该向导。If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third-party cookies and site data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the wizard again.

  2. 在“属性” 页中:In the Properties page:

    1. 输入 CopyPipeline作为任务名称。Enter CopyPipeline for Task name. 任务名称是数据工厂中管道的名称。The task name is the name of the pipeline in your data factory.
    2. 输入任务的说明(可选)。Enter a description for the task (optional).
    3. 对于任务频率或任务计划,请选中“按计划定期运行”选项。For Task cadence or Task schedule, keep the Run regularly on schedule option. 如果只想运行该任务一次而不是按计划定期运行,请选择“立即运行一次”。If you want to run this task only once instead of run repeatedly on a schedule, select Run once now. 如果选择“立即运行一次”选项,将创建一次性管道If you select, Run once now option, a one-time pipeline is created.
    4. 保存设置以用于“重复执行模式”。Keep the settings for Recurring pattern. 此任务在下一步中指定的开始和结束时间之间每日运行。This task runs daily between the start and end times you specify in the next step.
    5. 将开始日期时间更改为 2017/04/21。Change the Start date time to 04/21/2017.
    6. 将结束日期时间更改为 2017/04/25。Change the End date time to 04/25/2017. 建议直接键入日期而不是浏览日历。You may want to type the date instead of browsing through the calendar.
    7. 单击“下一步” 。Click Next. 复制工具 - 属性页Copy Tool - Properties page
  3. 在“源数据存储”页上,单击“Azure Blob 存储”磁贴。 On the Source data store page, click Azure Blob Storage tile. 此页用于指定复制任务的源数据存储。You use this page to specify the source data store for the copy task. 可使用现有的数据存储链接服务,或指定新的数据存储。You can use an existing data store linked service (or) specify a new data store. 要使用现有链接服务,请选择“来自现有链接服务” ,并选择适当的链接服务。To use an existing linked service, you would select FROM EXISTING LINKED SERVICES and select the right linked service. 复制工具 - 源数据存储页Copy Tool - Source data store page

  4. 在“指定 Azure Blob 存储帐户” 页上:On the Specify the Azure Blob storage account page:

    1. 保留自动生成的名称作为连接名称。Keep the auto-generated name for Connection name. 连接名称是 Azure 存储类型的链接服务名称。The connection name is the name of the linked service of type: Azure Storage.
    2. 确认为“帐户选择方法”选择了“来自 Azure 订阅”。 Confirm that From Azure subscriptions option is selected for Account selection method.
    3. 选择 Azure 订阅或针对 Azure 订阅选择“全选”。Select your Azure subscription or keep Select all for Azure subscription.
    4. 从所选订阅的可用 Azure 存储帐户列表中,选择一个 Azure 存储帐户Select an Azure storage account from the list of Azure storage accounts available in the selected subscription. 还可选择手动输入存储帐户设置,方法是在“帐户选择方法”中选择“手动输入”选项。You can also choose to enter storage account settings manually by selecting Enter manually option for the Account selection method.
    5. 单击“下一步” 。Click Next.
      复制工具 - 指定 Azure Blob 存储帐户Copy Tool - Specify the Azure Blob storage account
  5. 在“选择输入文件或文件夹” 页上:On Choose the input file or folder page:

    1. 双击“adfblobcontainer”。Double-click adfblobcontainer.
    2. 选择“input”,并单击“选择”。Select input, and click Choose. 在本演练中,选择输入文件夹。In this walkthrough, you select the input folder. 也可以改为选择文件夹中的 emp.txt 文件。You could also select the emp.txt file in the folder instead. 复制工具-选择输入文件或文件夹1Copy Tool - Choose the input file or folder 1
  6. 在“选择输入文件或文件夹” 页上:On the Choose the input file or folder page:

    1. 确认“文件或文件夹”已设置为 adfblobconnector/input。Confirm that the file or folder is set to adfblobconnector/input. 如果文件位于子文件夹中,例如,2017/04/01、2017/04/02 依此类推,请输入 adfblobconnector/input/{年}/{月}/{日} 作为文件或文件夹。If the files are in sub folders, for example, 2017/04/01, 2017/04/02, and so on, enter adfblobconnector/input/{year}/{month}/{day} for file or folder. 在文本框外按 Tab 时,会看到三个下拉列表,用于选择年 (yyyy)、月 (MM) 和日 (dd) 的格式。When you press TAB out of the text box, you see three drop-down lists to select formats for year (yyyy), month (MM), and day (dd).
    2. 请勿设置“以递归方式复制文件”。Do not set Copy file recursively. 选择此选项以递归方式遍历文件夹,寻找要复制到目标的文件。Select this option to recursively traverse through folders for files to be copied to the destination.
    3. 请勿选择“二进制复制”选项。Do not the binary copy option. 选择此选项将对源文件执行到目标的二进制复制。Select this option to perform a binary copy of source file to the destination. 请勿对此演练选择该选项,以便在下一页中看到更多选项。Do not select for this walkthrough so that you can see more options in the next pages.
    4. 确认“压缩类型”已设为“无”。Confirm that the Compression type is set to None. 如果源文件使用支持的格式之一进行压缩,请为此选项选择一个值。Select a value for this option if your source files are compressed in one of the supported formats.
    5. 单击“下一步” 。Click Next. 复制工具-选择输入文件或文件夹2Copy Tool - Choose the input file or folder 2
  7. 在“文件格式设置”页上,可以看到分隔符以及向导通过分析文件自动检测到的架构。On the File format settings page, you see the delimiters and the schema that is auto-detected by the wizard by parsing the file.

    1. 请确认以下选项:Confirm the following options:
      a.a. “文件格式”已设为“文本格式”。The file format is set to Text format. 可在下拉列表中看到所有支持的格式。You can see all the supported formats in the drop-down list. 例如:JSON、Avro、ORC 和 Parquet。For example: JSON, Avro, ORC, Parquet. b.b. “列分隔符”已设为 Comma (,)The column delimiter is set to Comma (,). 可在下拉列表中看到数据工厂支持的其他列分隔符。You can see the other column delimiters supported by Data Factory in the drop-down list. 还可以指定自定义分隔符。You can also specify a custom delimiter. c.c. “行分隔符”已设为 Carriage Return + Line feed (\r\n)The row delimiter is set to Carriage Return + Line feed (\r\n). 可在下拉列表中看到数据工厂支持的其他行分隔符。You can see the other row delimiters supported by Data Factory in the drop-down list. 还可以指定自定义分隔符。You can also specify a custom delimiter. d.d. “跳过行计数”已设为“0”。The skip line count is set to 0. 如果想要跳过文件顶部的几行,请在此处输入数字。If you want a few lines to be skipped at the top of the file, enter the number here. e.e. 未设置“第一数据行包含列名”。The first data row contains column names is not set. 如果源文件的第一行包含列名称,请选择此选项。If the source files contain column names in the first row, select this option. f.f. 已设置“将空列值视为 null”选项。The treat empty column value as null option is set.
    2. 展开“高级设置”以查看可用的高级选项。Expand Advanced settings to see advanced option available.
    3. 在页面底部,查看 emp.txt 文件的数据的“预览”。At the bottom of the page, see the preview of data from the emp.txt file.
    4. 单击底部的“架构”选项卡,查看复制向导通过查看源文件中的数据推断出来的架构。Click SCHEMA tab at the bottom to see the schema that the copy wizard inferred by looking at the data in the source file.
    5. 检查分隔符并预览数据之后,请单击“下一步”。Click Next after you review the delimiters and preview data. 复制工具 - 文件格式设置Copy Tool - File format settings
  8. 在“目标数据存储”页上,选择“Azure Blob 存储”,并单击“下一步”。On the Destination data store page, select Azure Blob Storage, and click Next. 在本演练中,使用 Azure blob 存储同时作为源和目标数据存储。You are using the Azure Blob Storage as both the source and destination data stores in this walkthrough.
    复制工具 - 选择目标数据存储Copy Tool - select destination data store

  9. 在“指定 Azure Blob 存储帐户”页上:On Specify the Azure Blob storage account page:

    1. 在“连接名称”字段中输入 AzureStorageLinkedService。Enter AzureStorageLinkedService for the Connection name field.
    2. 确认为“帐户选择方法”选择了“来自 Azure 订阅”。 Confirm that From Azure subscriptions option is selected for Account selection method.
    3. 选择 Azure 订阅Select your Azure subscription.
    4. 选择 Azure 存储帐户。Select your Azure storage account.
    5. 单击“下一步” 。Click Next.
  10. 在“选择输出文件或文件夹”页上:On the Choose the output file or folder page:

    1. 指定“文件夹路径”为 adfblobconnector/output/{年}/{月}/{日}。specify Folder path as adfblobconnector/output/{year}/{month}/{day}. 输入 TAB。Enter TAB.
    2. 对于“年”,请选择“yyyy”。For the year, select yyyy.
    3. 对于“月”,请确认它已设为“MM”。For the month, confirm that it is set to MM.
    4. 对于“日”,请确认它已设为“dd”。For the day, confirm that it is set to dd.
    5. 确认 " 压缩类型 " 设置为 " "。Confirm that the compression type is set to None.
    6. 确认“复制行为”已设为“合并文件”。Confirm that the copy behavior is set to Merge files. 如果已存在具有相同名称的输出文件,新内容将添加到相同文件的末尾。If the output file with the same name already exists, the new content is added to the same file at the end.
    7. 单击“下一步” 。Click Next. 复制工具 - 选择输出文件或文件夹Copy Tool - Choose output file or folder
  11. 在“文件格式设置”页上,查看设置,并单击“下一步”。On the File format settings page, review the settings, and click Next. 可在此处选择“向输出文件添加标题”。One of the additional options here is to add a header to the output file. 如果选择该选项,将添加一个标题行,包含源架构的列名称。If you select that option, a header row is added with names of the columns from the schema of the source. 查看源的架构时,可以重命名默认列名称。You can rename the default column names when viewing the schema for the source. 例如,可以将第一列改为“名字”,而第二列改为“姓氏”。For example, you could change the first column to First Name and second column to Last Name. 然后,将生成输出文件和标题,其中这些名称为列名称。Then, the output file is generated with a header with these names as column names. 复制工具 - 目标的文件格式设置Copy Tool - File format settings for destination

  12. 在“性能设置”页上,确认“云单位”和“并行副本”已设为“自动”,并单击“下一步”。On the Performance settings page, confirm that cloud units and parallel copies are set to Auto, and click Next. 有关这些设置的详细信息,请参阅复制活动性能和优化指南For details about these settings, see Copy activity performance and tuning guide. 复制工具 - 性能设置Copy Tool - Performance settings

  13. 在“摘要”页上,查看所有设置(任务属性、源和目标的设置以及复制设置),并单击“下一步”。On the Summary page, review all settings (task properties, settings for source and destination, and copy settings), and click Next. 复制工具 -“摘要”页Copy Tool - Summary page

  14. 在“摘要”页中检查信息,并单击“完成”。 Review information in the Summary page, and click Finish. 复制向导在数据工厂(即启动该向导的位置)中创建两个链接服务、两个数据集(输入和输出)和一个管道。The wizard creates two linked services, two datasets (input and output), and one pipeline in the data factory (from where you launched the Copy Wizard). 复制工具 -“部署”页Copy Tool - Deployment page

监视管道(复制任务)Monitor the pipeline (copy task)

  1. 单击“部署”页中的链接Click here to monitor copy pipelineClick the link Click here to monitor copy pipeline on the Deployment page.
  2. 你应在单独的选项卡中看到 " 监视和管理应用程序 "。 监视和管理应用You should see the Monitor and Manage application in a separate tab. Monitor and Manage App
  3. 将顶部的“开始”时间更改为 04/19/2017,“结束”时间改为 04/27/2017,然后单击“应用”。Change the start time at the top to 04/19/2017 and end time to 04/27/2017, and then click Apply.
  4. “活动时段”列表中应会显示五个活动时段。You should see five activity windows in the ACTIVITY WINDOWS list. “WindowStart”时间应包含从管道开始到管道结束时间的所有日子。The WindowStart times should cover all days from pipeline start to pipeline end times.
  5. 针对“活动时段”列表多次单击“刷新”按钮,直到看到所有活动时段均已设为“就绪”。Click Refresh button for the ACTIVITY WINDOWS list a few times until you see the status of all the activity windows is set to Ready.
  6. 现在,确认已在 adfblobconnector 容器的输出文件夹中生成输出文件。Now, verify that the output files are generated in the output folder of adfblobconnector container. 输出文件夹中应包含以下文件夹结构:You should see the following folder structure in the output folder:
    2017/04/21
    2017/04/22
    2017/04/23
    2017/04/24
    2017/04/25
    
    有关监视和管理数据工厂的详细信息,请参阅监视和管理数据工厂管道一文。For detailed information about monitoring and managing data factories, see Monitor and manage Data Factory pipeline article.

数据工厂实体Data Factory entities

现在,切换回具有“数据工厂”主页的选项卡。Now, switch back to the tab with the Data Factory home page. 请注意,数据工厂中现在有两个链接服务、两个数据集和一个管道。Notice that there are two linked services, two datasets, and one pipeline in your data factory now.

“数据工厂”主页和实体

单击“创作和部署”启动数据工厂编辑器。Click Author and deploy to launch Data Factory Editor.

数据工厂编辑器

数据工厂中应包含以下“数据工厂”实体:You should see the following Data Factory entities in your data factory:

  • 两个链接服务。Two linked services. 一个用于源,另一个用于目标。One for the source and the other one for the destination. 在本演练中,两个链接服务均指代同一 Azure 存储帐户。Both the linked services refer to the same Azure Storage account in this walkthrough.
  • 两个数据集。Two datasets. 一个输入数据集和一个输出数据集。An input dataset and an output dataset. 在本演练中,两个数据集使用相同的 blob 容器,但指代不同的文件夹(输入和输出文件夹)。In this walkthrough, both use the same blob container but refer to different folders (input and output).
  • 一个管道。A pipeline. 管道包含使用 blob 源和 blob 接收器将数据从 Azure blob 位置复制到另一个 Azure blob 位置的复制活动。The pipeline contains a copy activity that uses a blob source and a blob sink to copy data from an Azure blob location to another Azure blob location.

以下各节详细介绍了这些实体。The following sections provide more information about these entities.

链接服务Linked services

应看到两项链接服务。You should see two linked services. 一个用于源,另一个用于目标。One for the source and the other one for the destination. 在本演练中,这两个服务除名称不同外,定义看起来完全一样。In this walkthrough, both definitions look the same except for the names. 链接服务的“类型”已设为“AzureStorage”。The type of the linked service is set to AzureStorage. 链接服务定义中最重要的属性是“connectionString”,数据工厂在运行时用其连接到 Azure 存储帐户。Most important property of the linked service definition is the connectionString, which is used by Data Factory to connect to your Azure Storage account at runtime. 忽略定义中的 hubName 属性。Ignore the hubName property in the definition.

源 blob 存储链接服务Source blob storage linked service
{
    "name": "Source-BlobStorage-z4y",
    "properties": {
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
        }
    }
}
目标 blob 存储链接服务Destination blob storage linked service
{
    "name": "Destination-BlobStorage-z4y",
    "properties": {
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
        }
    }
}

有关 Azure 存储链接服务的详细信息,请参阅链接服务属性部分。For more information about Azure Storage linked service, see Linked service properties section.

数据集Datasets

有两个数据集:一个输入数据集和一个输出数据集。There are two datasets: an input dataset and an output dataset. 两个数据集的类型均已设为“AzureBlob”。The type of the dataset is set to AzureBlob for both.

输入数据集指向 adfblobconnector blob 容器的输入文件夹。The input dataset points to the input folder of the adfblobconnector blob container. 此数据集的“external”属性已设为 true,因为该数据不是由包含将此数据集作为输入的复制活动的管道生成的。The external property is set to true for this dataset as the data is not produced by the pipeline with the copy activity that takes this dataset as an input.

输出数据集指向同一 blob 容器的输出文件夹。The output dataset points to the output folder of the same blob container. 输出数据集也使用 SliceStart 系统变量的年、月、日来动态计算输出文件的路径。The output dataset also uses the year, month, and day of the SliceStart system variable to dynamically evaluate the path for the output file. 有关数据工厂支持的函数和系统变量列表,请参阅数据工厂函数和系统变量For a list of functions and system variables supported by Data Factory, see Data Factory functions and system variables. “external”属性已设为 false(默认值),因为此数据集由管道生成。The external property is set to false (default value) because this dataset is produced by the pipeline.

有关 Azure Blob 数据集支持的属性的详细信息,请参阅数据集属性部分。For more information about properties supported by Azure Blob dataset, see Dataset properties section.

输入数据集Input dataset
{
    "name": "InputDataset-z4y",
    "properties": {
        "structure": [
            { "name": "Prop_0", "type": "String" },
            { "name": "Prop_1", "type": "String" }
        ],
        "type": "AzureBlob",
        "linkedServiceName": "Source-BlobStorage-z4y",
        "typeProperties": {
            "folderPath": "adfblobconnector/input/",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ","
            }
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": true,
        "policy": {}
    }
}
输出数据集Output dataset
{
    "name": "OutputDataset-z4y",
    "properties": {
        "structure": [
            { "name": "Prop_0", "type": "String" },
            { "name": "Prop_1", "type": "String" }
        ],
        "type": "AzureBlob",
        "linkedServiceName": "Destination-BlobStorage-z4y",
        "typeProperties": {
            "folderPath": "adfblobconnector/output/{year}/{month}/{day}",
            "format": {
                "type": "TextFormat",
                "columnDelimiter": ","
            },
            "partitionedBy": [
                { "name": "year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
                { "name": "month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
                { "name": "day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }
            ]
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": false,
        "policy": {}
    }
}

管道Pipeline

管道只包含一个活动。The pipeline has just one activity. 将活动的“类型”设为“复制”。The type of the activity is set to Copy. 活动的类型属性有两个部分,一部分用于源,另一部分用于接收器。In the type properties for the activity, there are two sections, one for source and the other one for sink. 源类型已设为“BlobSource”,因为活动从 blob 存储复制数据。The source type is set to BlobSource as the activity is copying data from a blob storage. 接收器类型已设为“BlobSink”,因为活动将数据复制到 blob 存储。The sink type is set to BlobSink as the activity copying data to a blob storage. 复制活动采用 InputDataset-z4y 作为输入,采用OutputDataset-z4y 作为输出。The copy activity takes InputDataset-z4y as the input and OutputDataset-z4y as the output.

有关 BlobSource 和 BlobSink 支持的属性的详细信息,请参阅复制活动属性部分。For more information about properties supported by BlobSource and BlobSink, see Copy activity properties section.

{
    "name": "CopyPipeline",
    "properties": {
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": false
                    },
                    "sink": {
                        "type": "BlobSink",
                        "copyBehavior": "MergeFiles",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "InputDataset-z4y"
                    }
                ],
                "outputs": [
                    {
                        "name": "OutputDataset-z4y"
                    }
                ],
                "policy": {
                    "timeout": "1.00:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 3,
                    "longRetry": 0,
                    "longRetryInterval": "00:00:00"
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "Activity-0-Blob path_ adfblobconnector_input_->OutputDataset-z4y"
            }
        ],
        "start": "2017-04-21T22:34:00Z",
        "end": "2017-04-25T05:00:00Z",
        "isPaused": false,
        "pipelineMode": "Scheduled"
    }
}

向/从 Blob 存储复制数据的 JSON 示例JSON examples for copying data to and from Blob Storage

下面的示例提供示例 JSON 定义,可用于通过使用 Visual StudioAzure PowerShell创建管道。The following examples provide sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. 这些示例演示了如何将数据复制到 Azure Blob 存储和 Azure SQL 数据库,以及如何从中复制数据。They show how to copy data to and from Azure Blob Storage and Azure SQL Database. 但是,可使用 Azure 数据工厂中的复制活动将数据 直接 从任何源复制到 此处所述的任何接收器。However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory.

JSON 示例:将数据从 Blob 存储复制到 SQL 数据库JSON Example: Copy data from Blob Storage to SQL Database

以下示例显示:The following sample shows:

  1. AzureSqlDatabase类型的链接服务。A linked service of type AzureSqlDatabase.
  2. AzureStorage类型的链接服务。A linked service of type AzureStorage.
  3. AzureBlob类型的输入数据集An input dataset of type AzureBlob.
  4. AzureSqlTable 类型的输出数据集An output dataset of type AzureSqlTable.
  5. 包含复制活动的一个管道,该复制活动使用 BlobSourceSqlSinkA pipeline with a Copy activity that uses BlobSource and SqlSink.

此示例按小时将时序数据从 Azure blob 复制到 Azure SQL 表。The sample copies time-series data from an Azure blob to an Azure SQL table hourly. 对于这些示例中使用的 JSON 属性,在示例后的部分对其进行描述。The JSON properties used in these samples are described in sections following the samples.

Azure SQL 链接服务:Azure SQL linked service:

{
  "name": "AzureSqlLinkedService",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
    }
  }
}

Azure 存储链接服务:Azure Storage linked service:

{
  "name": "StorageLinkedService",
  "properties": {
    "type": "AzureStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
    }
  }
}

Azure 数据工厂支持两种类型的 Azure 存储链接服务:AzureStorageAzureStorageSasAzure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. 对于前者,请指定包括帐户密钥的连接字符串,对于后者,请指定共享访问签名 (SAS) URI。For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. 有关详细信息,请参阅链接服务部分。See Linked Services section for details.

Azure Blob 输入数据集:Azure Blob input dataset:

每小时从新的 blob 获取数据一次(频率:小时,间隔:1)。Data is picked up from a new blob every hour (frequency: hour, interval: 1). 根据处理中切片的开始时间,动态评估 blob 的文件夹路径和文件名。The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. 文件夹路径使用开始时间的年、月和日部分,文件名使用开始时间的小时部分。The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" 设置将告知数据工厂:表在数据工厂外部且不由数据工厂中的活动生成。"external": "true" setting informs Data Factory that the table is external to the data factory and is not produced by an activity in the data factory.

{
  "name": "AzureBlobInput",
  "properties": {
    "type": "AzureBlob",
    "linkedServiceName": "StorageLinkedService",
    "typeProperties": {
      "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/",
      "fileName": "{Hour}.csv",
      "partitionedBy": [
        { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
        { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
        { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
        { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
      ],
      "format": {
        "type": "TextFormat",
        "columnDelimiter": ",",
        "rowDelimiter": "\n"
      }
    },
    "external": true,
    "availability": {
      "frequency": "Hour",
      "interval": 1
    },
    "policy": {
      "externalData": {
        "retryInterval": "00:01:00",
        "retryTimeout": "00:10:00",
        "maximumRetry": 3
      }
    }
  }
}

Azure SQL 输出数据集:Azure SQL output dataset:

此示例将数据复制到 Azure SQL 数据库中名为 "MyTable" 的表。The sample copies data to a table named "MyTable" in Azure SQL Database. 在 SQL 数据库中创建表,其列数与 Blob CSV 文件要包含的列数相同。Create the table in your SQL database with the same number of columns as you expect the Blob CSV file to contain. 每隔一小时会向表添加新行。New rows are added to the table every hour.

{
  "name": "AzureSqlOutput",
  "properties": {
    "type": "AzureSqlTable",
    "linkedServiceName": "AzureSqlLinkedService",
    "typeProperties": {
      "tableName": "MyOutputTable"
    },
    "availability": {
      "frequency": "Hour",
      "interval": 1
    }
  }
}

管道中使用 Blob 源和 SQL 接收器的复制活动:A copy activity in a pipeline with Blob source and SQL sink:

管道包含配置为使用输入和输出数据集、且计划每小时运行一次的复制活动。The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. 在管道 JSON 定义中, 类型设置为 BlobSource接收器 类型设置为 SqlSinkIn the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlSink.

{
  "name":"SamplePipeline",
  "properties":{
    "start":"2014-06-01T18:00:00",
    "end":"2014-06-01T19:00:00",
    "description":"pipeline with copy activity",
    "activities":[
      {
        "name": "AzureBlobtoSQL",
        "description": "Copy Activity",
        "type": "Copy",
        "inputs": [
          {
            "name": "AzureBlobInput"
          }
        ],
        "outputs": [
          {
            "name": "AzureSqlOutput"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "BlobSource"
          },
          "sink": {
            "type": "SqlSink"
          }
        },
        "scheduler": {
          "frequency": "Hour",
          "interval": 1
        },
        "policy": {
          "concurrency": 1,
          "executionPriorityOrder": "OldestFirst",
          "retry": 0,
          "timeout": "01:00:00"
        }
      }
    ]
  }
}

JSON 示例:将数据从 Azure SQL 复制到 Azure BlobJSON Example: Copy data from Azure SQL to Azure Blob

以下示例显示:The following sample shows:

  1. AzureSqlDatabase类型的链接服务。A linked service of type AzureSqlDatabase.
  2. AzureStorage类型的链接服务。A linked service of type AzureStorage.
  3. AzureSqlTable类型的输入数据集An input dataset of type AzureSqlTable.
  4. AzureBlob类型的输出数据集An output dataset of type AzureBlob.
  5. 包含复制活动的一个管道,该复制活动使用 WebSourceBlobSinkA pipeline with Copy activity that uses SqlSource and BlobSink.

此示例按每小时将时序数据从 Azure SQL 表复制到 Azure blob。The sample copies time-series data from an Azure SQL table to an Azure blob hourly. 对于这些示例中使用的 JSON 属性,在示例后的部分对其进行描述。The JSON properties used in these samples are described in sections following the samples.

Azure SQL 链接服务:Azure SQL linked service:

{
  "name": "AzureSqlLinkedService",
  "properties": {
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
    }
  }
}

Azure 存储链接服务:Azure Storage linked service:

{
  "name": "StorageLinkedService",
  "properties": {
    "type": "AzureStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
    }
  }
}

Azure 数据工厂支持两种类型的 Azure 存储链接服务:AzureStorageAzureStorageSasAzure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. 对于前者,请指定包括帐户密钥的连接字符串,对于后者,请指定共享访问签名 (SAS) URI。For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. 有关详细信息,请参阅链接服务部分。See Linked Services section for details.

Azure SQL 输入数据集:Azure SQL input dataset:

该示例假设已在 Azure SQL 中创建表 "MyTable",并且它包含用于时序数据的名为 "timestampcolumn" 的列。The sample assumes you have created a table "MyTable" in Azure SQL and it contains a column called "timestampcolumn" for time series data.

设置 "external": "true" 将告知数据工厂服务:表在数据工厂外部且不由数据工厂中的活动生成。Setting "external": "true" informs Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory.

{
  "name": "AzureSqlInput",
  "properties": {
    "type": "AzureSqlTable",
    "linkedServiceName": "AzureSqlLinkedService",
    "typeProperties": {
      "tableName": "MyTable"
    },
    "external": true,
    "availability": {
      "frequency": "Hour",
      "interval": 1
    },
    "policy": {
      "externalData": {
        "retryInterval": "00:01:00",
        "retryTimeout": "00:10:00",
        "maximumRetry": 3
      }
    }
  }
}

Azure Blob 输出数据集:Azure Blob output dataset:

数据将写入到新 blob,每小时进行一次(频率:小时,间隔:1)。Data is written to a new blob every hour (frequency: hour, interval: 1). 根据处理中切片的开始时间,动态计算 blob 的文件夹路径。The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. 文件夹路径使用开始时间的年、月、日和小时部分。The folder path uses year, month, day, and hours parts of the start time.

{
  "name": "AzureBlobOutput",
  "properties": {
    "type": "AzureBlob",
    "linkedServiceName": "StorageLinkedService",
    "typeProperties": {
      "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
      "partitionedBy": [
        {
          "name": "Year",
          "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
        { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
        { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
        { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
      ],
      "format": {
        "type": "TextFormat",
        "columnDelimiter": "\t",
        "rowDelimiter": "\n"
      }
    },
    "availability": {
      "frequency": "Hour",
      "interval": 1
    }
  }
}

管道中使用 SQL 源和 Blob 接收器的复制活动:A copy activity in a pipeline with SQL source and Blob sink:

管道包含配置为使用输入和输出数据集、且计划每小时运行一次的复制活动。The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. 在管道 JSON 定义中, 类型设置为 SqlSource接收器 类型设置为 BlobSinkIn the pipeline JSON definition, the source type is set to SqlSource and sink type is set to BlobSink. SqlReaderQuery 属性指定的 SQL 查询选择复制过去一小时的数据。The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy.

{
    "name":"SamplePipeline",
    "properties":{
        "start":"2014-06-01T18:00:00",
        "end":"2014-06-01T19:00:00",
        "description":"pipeline for copy activity",
        "activities":[
            {
                "name": "AzureSQLtoBlob",
                "description": "copy activity",
                "type": "Copy",
                "inputs": [
                    {
                        "name": "AzureSQLInput"
                    }
                ],
                "outputs": [
                    {
                        "name": "AzureBlobOutput"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "SqlSource",
                        "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
                    },
                    "sink": {
                        "type": "BlobSink"
                    }
                },
                "scheduler": {
                    "frequency": "Hour",
                    "interval": 1
                },
                "policy": {
                    "concurrency": 1,
                    "executionPriorityOrder": "OldestFirst",
                    "retry": 0,
                    "timeout": "01:00:00"
                }
            }
        ]
    }
}

备注

要将源数据集中的列映射到接收器数据集中的列,请参阅映射 Azure 数据工厂中的数据集列To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory.

性能和优化Performance and Tuning

若要了解影响 Azure 数据工厂中数据移动(复制活动)性能的关键因素及各种优化方法,请参阅复制活动性能和优化指南See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.