DataLakeFileClient Class

A client to interact with the DataLake file, even if the file may not yet exist.

Inheritance
azure.storage.filedatalake._path_client.PathClient
DataLakeFileClient

Constructor

DataLakeFileClient(account_url, file_system_name, file_path, credential=None, **kwargs)

Parameters

account_url
str
Required

The URI to the storage account.

file_system_name
str
Required

The file system for the directory or files.

file_path
str
Required

The whole file path, so that to interact with a specific file. eg. "{directory}/{subdirectory}/{file}"

credential
Required

The credentials with which to authenticate. This is optional if the account URL already has a SAS token. The value can be a SAS token string, an instance of a AzureSasCredential from azure.core.credentials, an account shared access key, or an instance of a TokenCredentials class from azure.identity. If the resource URI already contains a SAS token, this will be ignored in favor of an explicit credential

  • except in the case of AzureSasCredential, where the conflicting SAS tokens will raise a ValueError.

Examples

Creating the DataLakeServiceClient from connection string.


   from azure.storage.filedatalake import DataLakeFileClient
   DataLakeFileClient.from_connection_string(connection_string, "myfilesystem", "mydirectory", "myfile")

Variables

url
str

The full endpoint URL to the file system, including SAS token if used.

primary_endpoint
str

The full primary endpoint URL.

primary_hostname
str

The hostname of the primary endpoint.

Methods

append_data

Append data to the file.

create_file

Create a new file.

delete_file

Marks the specified file for deletion.

download_file

Downloads a file to the StorageStreamDownloader. The readall() method must be used to read all the content, or readinto() must be used to download the file into a stream. Using chunks() returns an iterator which allows the user to iterate over the content in chunks.

exists

Returns True if a file exists and returns False otherwise.

flush_data

Commit the previous appended data.

from_connection_string

Create DataLakeFileClient from a Connection String.

:return a DataLakeFileClient :rtype ~azure.storage.filedatalake.DataLakeFileClient

get_file_properties

Returns all user-defined metadata, standard HTTP properties, and system properties for the file. It does not return the content of the file.

query_file

Enables users to select/project on datalake file data by providing simple query expressions. This operations returns a DataLakeFileQueryReader, users need to use readall() or readinto() to get query data.

rename_file

Rename the source file.

set_file_expiry

Sets the time a file will expire and be deleted.

upload_data

Upload data to a file.

append_data

Append data to the file.

append_data(data, offset, length=None, **kwargs)

Parameters

data
Required

Content to be appended to file

offset
Required

start position of the data to be appended to.

length
default value: None

Size of the data in bytes.

validate_content
bool

If true, calculates an MD5 hash of the block content. The storage service checks the hash of the content that has arrived with the hash that was sent. This is primarily valuable for detecting bitflips on the wire if using http instead of https as https (the default) will already validate. Note that this MD5 hash is not stored with the file.

lease
DataLakeLeaseClient or str

Required if the file has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string.

Returns

dict of the response header

Examples

Append data to the file.


   file_client.append_data(data=file_content[2048:3072], offset=2048, length=1024)

create_file

Create a new file.

create_file(content_settings=None, metadata=None, **kwargs)

Parameters

content_settings
ContentSettings
default value: None

ContentSettings object used to set path properties.

metadata
dict(str, str)
default value: None

Name-value pairs associated with the file as metadata.

lease
DataLakeLeaseClient or str

Required if the file has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string.

umask
str

Optional and only valid if Hierarchical Namespace is enabled for the account. When creating a file or directory and the parent folder does not have a default ACL, the umask restricts the permissions of the file or directory to be created. The resulting permission is given by p & ^u, where p is the permission and u is the umask. For example, if p is 0777 and u is 0057, then the resulting permission is 0720. The default permission is 0777 for a directory and 0666 for a file. The default umask is 0027. The umask must be specified in 4-digit octal notation (e.g. 0766).

permissions
str

Optional and only valid if Hierarchical Namespace is enabled for the account. Sets POSIX access permissions for the file owner, the file owning group, and others. Each class may be granted read, write, or execute permission. The sticky bit is also supported. Both symbolic (rwxrw-rw-) and 4-digit octal notation (e.g. 0766) are supported.

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

timeout
int

The timeout parameter is expressed in seconds.

Returns

response dict (Etag and last modified).

Examples

Create file.


   file_client = filesystem_client.get_file_client(file_name)
   file_client.create_file()

delete_file

Marks the specified file for deletion.

delete_file(**kwargs)

Parameters

lease
DataLakeLeaseClient or str

Required if the file has an active lease. Value can be a LeaseClient object or the lease ID as a string.

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

timeout
int

The timeout parameter is expressed in seconds.

Returns

None

Examples

Delete file.


   new_client.delete_file()

download_file

Downloads a file to the StorageStreamDownloader. The readall() method must be used to read all the content, or readinto() must be used to download the file into a stream. Using chunks() returns an iterator which allows the user to iterate over the content in chunks.

download_file(offset=None, length=None, **kwargs)

Parameters

offset
int
default value: None

Start of byte range to use for downloading a section of the file. Must be set if length is provided.

length
int
default value: None

Number of bytes to read from the stream. This is optional, but should be supplied for optimal performance.

lease
DataLakeLeaseClient or str

If specified, download only succeeds if the file's lease is active and matches this ID. Required if the file has an active lease.

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

max_concurrency
int

The number of parallel connections with which to download.

timeout
int

The timeout parameter is expressed in seconds. This method may make multiple calls to the Azure service and the timeout will apply to each call individually.

Returns

A streaming object (StorageStreamDownloader)

Return type

Examples

Return the downloaded data.


   download = file_client.download_file()
   downloaded_bytes = download.readall()

exists

Returns True if a file exists and returns False otherwise.

exists(**kwargs)

Parameters

timeout
int

The timeout parameter is expressed in seconds.

Returns

boolean

flush_data

Commit the previous appended data.

flush_data(offset, retain_uncommitted_data=False, **kwargs)

Parameters

offset
Required

offset is equal to the length of the file after commit the previous appended data.

retain_uncommitted_data
bool
default value: False

Valid only for flush operations. If "true", uncommitted data is retained after the flush operation completes; otherwise, the uncommitted data is deleted after the flush operation. The default is false. Data at offsets less than the specified position are written to the file when flush succeeds, but this optional parameter allows data after the flush position to be retained for a future flush operation.

content_settings
ContentSettings

ContentSettings object used to set path properties.

close
bool

Azure Storage Events allow applications to receive notifications when files change. When Azure Storage Events are enabled, a file changed event is raised. This event has a property indicating whether this is the final change to distinguish the difference between an intermediate flush to a file stream and the final close of a file stream. The close query parameter is valid only when the action is "flush" and change notifications are enabled. If the value of close is "true" and the flush operation completes successfully, the service raises a file change notification with a property indicating that this is the final update (the file stream has been closed). If "false" a change notification is raised indicating the file has changed. The default is false. This query parameter is set to true by the Hadoop ABFS driver to indicate that the file stream has been closed."

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

Returns

response header in dict

Examples

Commit the previous appended data.


   with open(SOURCE_FILE, "rb") as data:
       file_client = file_system_client.get_file_client("myfile")
       file_client.create_file()
       file_client.append_data(data, 0)
       file_client.flush_data(data.tell())

from_connection_string

Create DataLakeFileClient from a Connection String.

:return a DataLakeFileClient :rtype ~azure.storage.filedatalake.DataLakeFileClient

from_connection_string(conn_str, file_system_name, file_path, credential=None, **kwargs)

Parameters

conn_str
str
Required

A connection string to an Azure Storage account.

file_system_name
str
Required

The name of file system to interact with.

directory_name
str
Required

The name of directory to interact with. The directory is under file system.

file_name
str
Required

The name of file to interact with. The file is under directory.

credential
default value: None

The credentials with which to authenticate. This is optional if the account URL already has a SAS token, or the connection string already has shared access key values. The value can be a SAS token string, an instance of a AzureSasCredential from azure.core.credentials, an account shared access key, or an instance of a TokenCredentials class from azure.identity. Credentials provided here will take precedence over those in the connection string.

get_file_properties

Returns all user-defined metadata, standard HTTP properties, and system properties for the file. It does not return the content of the file.

get_file_properties(**kwargs)

Parameters

lease

Required if the directory or file has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string.

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

timeout
int

The timeout parameter is expressed in seconds.

Return type

<xref:FileProperties>

Examples

Getting the properties for a file.


   properties = file_client.get_file_properties()

query_file

Enables users to select/project on datalake file data by providing simple query expressions. This operations returns a DataLakeFileQueryReader, users need to use readall() or readinto() to get query data.

query_file(query_expression, **kwargs)

Parameters

query_expression
str
Required

Required. a query statement. eg. Select * from DataLakeStorage

on_error
<xref:Callable>[<xref:azure.storage.filedatalake.DataLakeFileQueryError>]

A function to be called on any processing errors returned by the service.

file_format
DelimitedTextDialect or DelimitedJsonDialect

Optional. Defines the serialization of the data currently stored in the file. The default is to treat the file data as CSV data formatted in the default dialect. This can be overridden with a custom DelimitedTextDialect, or alternatively a DelimitedJsonDialect.

output_format
DelimitedTextDialect, DelimitedJsonDialect or list[ArrowDialect]

Optional. Defines the output serialization for the data stream. By default the data will be returned as it is represented in the file. By providing an output format, the file data will be reformatted according to that profile. This value can be a DelimitedTextDialect or a DelimitedJsonDialect.

lease
DataLakeLeaseClient or str

Required if the file has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string.

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

timeout
int

The timeout parameter is expressed in seconds.

Returns

A streaming object (DataLakeFileQueryReader)

Return type

<xref:azure.storage.filedatalake.DataLakeFileQueryReader>

Examples

select/project on datalake file data by providing simple query expressions.


   errors = []
   def on_error(error):
       errors.append(error)

   # upload the csv file
   file_client = datalake_service_client.get_file_client(filesystem_name, "csvfile")
   file_client.upload_data(CSV_DATA, overwrite=True)

   # select the second column of the csv file
   query_expression = "SELECT _2 from DataLakeStorage"
   input_format = DelimitedTextDialect(delimiter=',', quotechar='"', lineterminator='\n', escapechar="", has_header=False)
   output_format = DelimitedJsonDialect(delimiter='\n')
   reader = file_client.query_file(query_expression, on_error=on_error, file_format=input_format, output_format=output_format)
   content = reader.readall()

rename_file

Rename the source file.

rename_file(new_name, **kwargs)

Parameters

new_name
str
Required

the new file name the user want to rename to. The value must have the following format: "{filesystem}/{directory}/{subdirectory}/{file}".

content_settings
ContentSettings

ContentSettings object used to set path properties.

source_lease
DataLakeLeaseClient or str

A lease ID for the source path. If specified, the source path must have an active lease and the leaase ID must match.

lease

Required if the file/directory has an active lease. Value can be a LeaseClient object or the lease ID as a string.

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

source_if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

source_if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

source_etag
str

The source ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

source_match_condition
MatchConditions

The source match condition to use upon the etag.

timeout
int

The timeout parameter is expressed in seconds.

Returns

the renamed file client

Return type

<xref:DataLakeFileClient>

Examples

Rename the source file.


   new_client = file_client.rename_file(file_client.file_system_name + '/' + 'newname')

set_file_expiry

Sets the time a file will expire and be deleted.

set_file_expiry(expiry_options, expires_on=None, **kwargs)

Parameters

expiry_options
str
Required

Required. Indicates mode of the expiry time. Possible values include: 'NeverExpire', 'RelativeToCreation', 'RelativeToNow', 'Absolute'

or int expires_on
datetime
default value: None

The time to set the file to expiry. When expiry_options is RelativeTo*, expires_on should be an int in milliseconds. If the type of expires_on is datetime, it should be in UTC time.

timeout
int

The timeout parameter is expressed in seconds.

Return type

upload_data

Upload data to a file.

upload_data(data, length=None, overwrite=False, **kwargs)

Parameters

data
Required

Content to be uploaded to file

length
int
default value: None

Size of the data in bytes.

overwrite
bool
default value: False

to overwrite an existing file or not.

content_settings
ContentSettings

ContentSettings object used to set path properties.

metadata
dict(str, str)

Name-value pairs associated with the blob as metadata.

or str lease
DataLakeLeaseClient

Required if the blob has an active lease. Value can be a DataLakeLeaseClient object or the lease ID as a string.

umask
str

Optional and only valid if Hierarchical Namespace is enabled for the account. When creating a file or directory and the parent folder does not have a default ACL, the umask restricts the permissions of the file or directory to be created. The resulting permission is given by p & ^u, where p is the permission and u is the umask. For example, if p is 0777 and u is 0057, then the resulting permission is 0720. The default permission is 0777 for a directory and 0666 for a file. The default umask is 0027. The umask must be specified in 4-digit octal notation (e.g. 0766).

permissions
str

Optional and only valid if Hierarchical Namespace is enabled for the account. Sets POSIX access permissions for the file owner, the file owning group, and others. Each class may be granted read, write, or execute permission. The sticky bit is also supported. Both symbolic (rwxrw-rw-) and 4-digit octal notation (e.g. 0766) are supported.

if_modified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has been modified since the specified time.

if_unmodified_since
datetime

A DateTime value. Azure expects the date value passed in to be UTC. If timezone is included, any non-UTC datetimes will be converted to UTC. If a date is passed in without timezone info, it is assumed to be UTC. Specify this header to perform the operation only if the resource has not been modified since the specified date/time.

validate_content
bool

If true, calculates an MD5 hash for each chunk of the file. The storage service checks the hash of the content that has arrived with the hash that was sent. This is primarily valuable for detecting bitflips on the wire if using http instead of https, as https (the default), will already validate. Note that this MD5 hash is not stored with the blob. Also note that if enabled, the memory-efficient upload algorithm will not be used because computing the MD5 hash requires buffering entire blocks, and doing so defeats the purpose of the memory-efficient algorithm.

etag
str

An ETag value, or the wildcard character (*). Used to check if the resource has changed, and act according to the condition specified by the match_condition parameter.

match_condition
MatchConditions

The match condition to use upon the etag.

timeout
int

The timeout parameter is expressed in seconds.

chunk_size
int

The maximum chunk size for uploading a file in chunks. Defaults to 100*1024*1024, or 100MB.

Returns

response dict (Etag and last modified).