FileDataset 類別

參考

代表數據存放區或公用 URL 中要用於 Azure Machine Learning 的檔案參考集合。

FileDataset 會定義一系列的延遲評估、不可變的作業，以將數據從數據源載入檔案數據流。除非要求 FileDataset 傳遞數據，否則不會從來源載入數據。

FileDataset 是使用 from_files FileDatasetFactory 類別的方法來建立。

如需詳細資訊，請參閱新增 & 註冊數據集一文。若要開始使用檔案資料集，請參閱 https://aka.ms/filedataset-samplenotebook。

初始化 FileDataset 物件。

此建構函式不應該直接叫用。數據集旨在使用 FileDatasetFactory 類別來建立。

繼承: AbstractDataset

FileDataset

建構函式

FileDataset()

備註

FileDataset 可用來做為實驗執行的輸入。它也可以註冊至具有指定名稱的工作區，稍後再由該名稱擷取。

您可以叫用這個類別上可用的不同子設定方法來子集 FileDataset。子設定的結果一律是新的 FileDataset。

當要求 FileDataset 將數據傳遞至另一個儲存機制時，會發生實際的數據載入 (例如下載或掛接至本機路徑的檔案) 。

方法

as_cache	注意這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。建立對應至datacache_store和數據集的 DatacacheConsumptionConfig。
as_download	使用設定要下載的模式建立 DatasetConsumptionConfig。在提交的執行中，數據集中的檔案將會下載到計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets欄位擷取下載位置。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。 # Given a run submitted with dataset input like this: dataset_input = dataset.as_download() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The download location can be retrieved from argument values import sys download_location = sys.argv[1] # The download location can also be retrieved from input_datasets of the run context. from azureml.core import Run download_location = Run.get_context().input_datasets['input_1']
as_hdfs	將模式設定為 hdfs。在提交的 synapse 執行中，數據集中的檔案將會轉換成計算目標上的本機路徑。 hdfs 路徑可以從自變數值和os環境變數中擷取。 `# Given a run submitted with dataset input like this: dataset_input = dataset.as_hdfs() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The hdfs path can be retrieved from argument values import sys hdfs_path = sys.argv[1] # The hdfs path can also be retrieved from input_datasets of the run context. import os hdfs_path = os.environ['input_<hash>']`
as_mount	使用設定為掛接的模式建立 DatasetConsumptionConfig。在提交的執行中，數據集中的檔案會掛接至計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets字段擷取裝入點。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。 `# Given a run submitted with dataset input like this: dataset_input = dataset.as_mount() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The mount point can be retrieved from argument values import sys mount_point = sys.argv[1] # The mount point can also be retrieved from input_datasets of the run context. from azureml.core import Run mount_point = Run.get_context().input_datasets['input_1']`
download	下載數據集所定義的檔案數據流作為本機檔案。
file_metadata	注意這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。藉由指定元數據行名稱來取得檔案元數據表達式。支持的檔案元數據數據行包括 Size、LastModifiedTime、CreationTime、Extension 和 CanSeek
filter	注意這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。篩選數據，只保留符合指定表達式的記錄。
hydrate	注意這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。將數據集凍結成datacache_store中指定的要求複本。
mount	建立內容管理員，以掛接數據集所定義的檔案數據流作為本機檔案。
random_split	將數據集中的檔案串流隨機分割成兩個部分，大約由指定的百分比來分割。傳回的第一個數據集大約 `percentage` 包含檔案參考總數，而第二個數據集則包含其餘的檔案參考。
skip	依指定的計數，略過數據集頂端的檔案數據流。
take	依指定的計數，從數據集頂端擷取檔案數據流的範例。
take_sample	以大約指定的機率，取得數據集中檔案數據流的隨機樣本。
to_path	取得資料集所定義之每個檔案數據流的檔案路徑清單。

as_cache

注意

這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

建立對應至datacache_store和數據集的 DatacacheConsumptionConfig。

as_cache(datacache_store)

參數

datacache_store: DatacacheStore

必要

要用來凍結的 datacachestore。

傳回

描述如何在執行中具體化 datacache 的組態物件。

傳回類型

DatacacheConsumptionConfig

as_download

使用設定要下載的模式建立 DatasetConsumptionConfig。

在提交的執行中，數據集中的檔案將會下載到計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets欄位擷取下載位置。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_download()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The download location can be retrieved from argument values
   import sys
   download_location = sys.argv[1]

   # The download location can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   download_location = Run.get_context().input_datasets['input_1']

as_download(path_on_compute=None)

參數

path_on_compute: str

預設值: None

計算上要讓數據可供使用的目標路徑。

備註

從單一檔案的路徑建立數據集時，下載位置將會是單一下載檔案的路徑。否則，下載位置將會是所有已下載檔案的封入資料夾路徑。

如果path_on_compute以 /開頭，則會將其視為絕對路徑。如果它不是以 /開頭，則會將其視為相對於工作目錄的相對路徑。如果您已指定絕對路徑，請確定作業具有寫入該目錄的許可權。

as_hdfs

將模式設定為 hdfs。

在提交的 synapse 執行中，數據集中的檔案將會轉換成計算目標上的本機路徑。 hdfs 路徑可以從自變數值和os環境變數中擷取。


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_hdfs()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The hdfs path can be retrieved from argument values
   import sys
   hdfs_path = sys.argv[1]

   # The hdfs path can also be retrieved from input_datasets of the run context.
   import os
   hdfs_path = os.environ['input_<hash>']

as_hdfs()

備註

從單一檔案的路徑建立數據集時，hdfs 路徑會是單一檔案的路徑。否則，hdfs 路徑會是所有掛接檔案的封入資料夾路徑。

as_mount

使用設定為掛接的模式建立 DatasetConsumptionConfig。

在提交的執行中，數據集中的檔案會掛接至計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets字段擷取裝入點。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_mount()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The mount point can be retrieved from argument values
   import sys
   mount_point = sys.argv[1]

   # The mount point can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   mount_point = Run.get_context().input_datasets['input_1']

as_mount(path_on_compute=None)

參數

path_on_compute: str

預設值: None

計算上要讓數據可供使用的目標路徑。

備註

從單一檔案的路徑建立數據集時，裝入點將會是單一掛接檔案的路徑。否則，載入點將會是所有掛接檔案的封入資料夾路徑。

download

下載數據集所定義的檔案數據流作為本機檔案。

download(target_path=None, overwrite=False, ignore_not_found=False)

參數

target_path: str

必要

要下載檔案的本機目錄。如果為 None，數據將會下載到暫存目錄中。

overwrite: bool

必要

指出是否要覆寫現有的檔案。預設值是 False。如果覆寫設定為 True，則會覆寫現有的檔案;否則會引發例外狀況。

ignore_not_found: bool

必要

指出如果找不到數據集所指向的某些檔案，是否無法下載。預設值是 False。如果任何檔案下載因為任何原因而失敗，如果ignore_not_found設定為 False，則下載將會失敗;否則，只要遇到其他錯誤類型，就會記錄找不到錯誤的衝突，而且 dowload 將會成功。

傳回

傳回所下載每個檔案的檔案路徑陣列。

傳回類型

list(str)

備註

如果target_path以 /開頭，則會將其視為絕對路徑。如果它不是以 /開頭，則會將其視為相對於目前工作目錄的相對路徑。

file_metadata

注意

這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

藉由指定元數據行名稱來取得檔案元數據表達式。

支持的檔案元數據數據行包括 Size、LastModifiedTime、CreationTime、Extension 和 CanSeek

file_metadata(col)

參數

col: str

必要

數據行的名稱

傳回

傳回表達式，這個表示式會擷取指定數據行中的值。

傳回類型

<xref:azureml.dataprep.api.expression.RecordFieldExpression>

filter

注意

這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

篩選數據，只保留符合指定表達式的記錄。

filter(expression)

參數

expression: <xref:azureml.dataprep.api.expression.Expression>

必要

要評估的運算式。

傳回

已修改的數據集 (取消註冊) 。

傳回類型

FileDataset

備註

表達式的開頭是使用數據行名稱編製數據集的索引。它們支持各種函式和運算符，而且可以使用邏輯運算符來結合。當數據提取發生，而不是定義數據時，產生的表達式將會針對每個記錄進行延遲評估。


   (dataset.file_metadata('Size') > 10000) & (dataset.file_metadata('CanSeek') == True)
   dataset.file_metadata('Extension').starts_with('j')

hydrate

注意

這是實驗性方法，隨時可能會變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

將數據集凍結成datacache_store中指定的要求複本。

hydrate(datacache_store, replica_count=None)

參數

datacache_store: DatacacheStore

必要

要用來凍結的 datacachestore。

replica_count: <xref:Int>, <xref:optional>

必要

要凍結的複本數目。

傳回

描述如何在執行中具體化 datacache 的組態物件。

傳回類型

DatacacheHydrationTracker

mount

建立內容管理員，以掛接數據集所定義的檔案數據流作為本機檔案。

mount(mount_point=None, **kwargs)

參數

mount_point: str

必要

要掛接檔案的本機目錄。如果為 None，數據將會掛接至暫存目錄，您可以藉由呼叫 MountContext.mount_point 實例方法來找到該目錄。

傳回

傳回用於管理掛接生命周期的內容管理員。

傳回類型

<xref:<xref:MountContext: the context manager. Upon entering the context manager>, <xref:the dataflow will bemounted to the mount_point. Upon exit>, <xref:it will remove the mount point and clean up the daemon processused to mount the dataflow.>>

備註

系統將會傳回內容管理員以管理掛接的生命週期。若要掛接，您必須輸入內容管理員並取消掛接，請從內容管理員結束。

只有在已安裝原生套件 libfuse 的 Unix 或類似 Unix 的作業系統上才支援掛接。如果您在 Docker 容器內執行，docker 容器必須以 –privileged 旗標啟動，或以 –cap-add SYS_ADMIN –device /dev/fuse 啟動。


   datastore = Datastore.get(workspace, 'workspaceblobstore')
   dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))

   with dataset.mount() as mount_context:
       # list top level mounted files and folders in the dataset
       os.listdir(mount_context.mount_point)

   # You can also use the start and stop methods
   mount_context = dataset.mount()
   mount_context.start()  # this will mount the file streams
   mount_context.stop()  # this will unmount the file streams

如果target_path以 /開頭，則會將其視為絕對路徑。如果它不是以 /開頭，則會將其視為相對於目前工作目錄的相對路徑。

random_split

將數據集中的檔案串流隨機分割成兩個部分，大約由指定的百分比來分割。

傳回的第一個數據集大約 percentage 包含檔案參考總數，而第二個數據集則包含其餘的檔案參考。

random_split(percentage, seed=None)

參數

percentage: float

必要

分割數據集的大約百分比。這必須是介於 0.0 到 1.0 之間的數位。

seed: int

必要

要用於隨機產生器的選擇性種子。

傳回

傳回新 FileDataset 物件的 Tuple，代表分割之後的兩個數據集。

傳回類型

(FileDataset, FileDataset)

skip

依指定的計數，略過數據集頂端的檔案數據流。

skip(count)

參數

count: int

必要

要略過的檔案數據流數目。

傳回

傳回新的 FileDataset 物件，代表略過檔案數據流的數據集。

傳回類型

FileDataset

take

依指定的計數，從數據集頂端擷取檔案數據流的範例。

take(count)

參數

count: int

必要

要接受的檔案數據流數目。

傳回

會傳回代表取樣數據集的新 FileDataset 物件。

傳回類型

FileDataset

take_sample

以大約指定的機率，取得數據集中檔案數據流的隨機樣本。

take_sample(probability, seed=None)

參數

probability: float

必要

範例中包含檔案數據流的機率。

seed: int

必要

要用於隨機產生器的選擇性種子。

傳回

會傳回代表取樣數據集的新 FileDataset 物件。

傳回類型

FileDataset

to_path

取得資料集所定義之每個檔案數據流的檔案路徑清單。

to_path()

傳回

傳回檔案路徑的陣列。

傳回類型

list(str)

備註

下載或掛接檔案數據流時，檔案路徑是本機檔案的相對路徑。

一般前置詞將會根據指定數據源來建立數據集的方式，從檔案路徑中移除。例如：


   datastore = Datastore.get(workspace, 'workspaceblobstore')
   dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))
   print(dataset.to_path())

   # ['year-2018/1.jpg'
   #  'year-2018/2.jpg'
   #  'year-2019/1.jpg']

   dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/green-small/*.csv')

   print(dataset.to_path())
   # ['/green_tripdata_2013-08.csv']

FileDataset 類別

建構函式

備註

方法

as_cache

參數

傳回

傳回類型

as_download

參數

備註

as_hdfs

備註

as_mount

參數

備註

download

參數

傳回

傳回類型

備註

file_metadata

參數

傳回

傳回類型

filter

參數

傳回

傳回類型

備註

hydrate

參數

傳回

傳回類型

mount

參數

傳回

傳回類型

備註

random_split

參數

傳回

傳回類型

skip

參數

傳回

傳回類型

take

參數

傳回

傳回類型

take_sample

參數

傳回

傳回類型

to_path

傳回

傳回類型

備註

意見反應

意見反應

其他資源