How can I get all files inside a drive irrespective of folder structure in ADF?

Mansi Yadav 20

I want to copy files files from sharepoint drive which has lots of nested folders. Maximum hierarchy of folders is for 12 levels. Currently I'm using below endpoint in ADF's Web activity as it was mentioned in some of the articles that it provides every element inside the drive at any folder level.

https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true

I was handling odata. next link also. But when I executed the pipeline, it was running fine but after 5th cycle I noticed that results were somewhat similar to the previous one. It copied only 600 files out of 50k files.

Please suggest me one endpoint that can easily fetch all the files or folders at different folder level.

phemanth 6,810 Reputation points Microsoft Vendor

2024-05-14T06:45:14.24+00:00
@mansi yadav

Welcome to the Microsoft Q&A platform and thank you for posting your question here.

While there isn't a single endpoint that directly provides this functionality, here are two effective approaches you can combine:

1. Leverage Get Metadata Activity for Folder Structure:

Use the Get Metadata activity to retrieve the folder hierarchy within the SharePoint drive. This activity retrieves metadata about folders and subfolders, enabling you to iterate through the structure.

Configure the Get Metadata activity with the following settings:

Dataset: Point to your SharePoint drive dataset.

Folder Path: Set it to the root folder path of your SharePoint drive (e.g., /sites/yoursite/Shared Documents).

Recursive: Set to true to include subfolders in the metadata retrieval.

List Children: Set to true to obtain details about child folders and files.

2. Utilize a Loop to Process Files Recursively:

Employ a For Each activity to iterate over the folder structure retrieved by Get Metadata.

Inside the loop, use a Web Activity to call the Microsoft Graph API endpoint you mentioned:
https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true

Access the current folder path within the loop using @item().folderPath (replace item() with the appropriate variable name based on your ADF version).

Optionally, include additional filtering criteria (e.g., file type) within the $filter parameter.

Handle pagination using the odata.nextLink property returned by the API to retrieve all files.

Additional Considerations:

Error Handling: Implement error handling mechanisms within the loop to gracefully handle potential issues during file retrieval.

Performance Optimization: For very large drives, consider techniques like batching file requests or leveraging Azure Data Explorer for efficient data processing.

Hope this helps. Do let us know if you any further queries.
Mansi Yadav 20 Reputation points

2024-05-14T15:31:33.5533333+00:00

Can you help me in creating an iterative approach. Let's say I have a drive now I'm using endpoint drives/drive-id/root/children, Now I have all initial folders and files, I can easily filter out all the files and copy them, if there is odata.nextLink, I can add a until activity to deal with it. But now my doubt is how to deal with folders. I can use drives/drive-id/items/folder-id/children this will give me next file and folder but how to iterate with next level and so on. Thi is my doubt
phemanth 6,810 Reputation points Microsoft Vendor

2024-05-15T06:40:32.54+00:00
@mansi yadav

Here's how you can create an iterative approach in ADF to copy files from SharePoint with nested folders (up to 12 levels) using the approach outlined previously:

1. Get Initial Folders and Files:

Use a Get Metadata activity with the following settings:

Dataset: Point to your SharePoint drive dataset.

Folder Path: Set it to the root folder path (e.g., /sites/yoursite/Shared Documents).

Recursive: Set to true to include subfolders.

List Children: Set to true to obtain details about child folders and files.

This retrieves all folders (including nested ones) and files at the root level.

2. Loop and Process Folders:

Use a For Each activity to iterate through the output of the Get Metadata activity. This loop will process each folder (and its descendants).

Inside the loop:

Identify Folder Type: Check if the current item (@item()) is a folder using @item().folder property (true for folders, false for files).

Process Files (if folder is not encountered):

If it's not a folder, use a Web Activity to call the existing endpoint:
https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true

Update the folder path dynamically using @item().folderPath to target the current folder.

Use a Copy Data activity to copy the retrieved files to Azure Data Lake Storage.

Implement pagination using odata.nextLink returned by the API to ensure all files within the folder are copied.

Process Folders (if folder is encountered):

Use another Web Activity to call the endpoint for retrieving children:
https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/@{item().id}/children/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true

This retrieves the child items (folders and files) within the current folder.

Use a Set Variable activity to create a new variable (e.g., childItems) and store the output of the web activity (folder and file details).

Use another For Each activity to iterate through the childItems variable. This inner loop will process the children of the current folder, recursively handling nested structures.

3. Handle Pagination (Optional):

Within both file and child folder processing loops, check for the odata.nextLink property in the web activity output.

If odata.nextLink exists, use a Set Variable activity to update a variable holding the next link.

Add an Until activity with the following condition: @not(empty(variables('nextLink'))).

Inside the Until loop, use another Web Activity to call the odata.nextLink endpoint to retrieve the next set of files/folders and continue processing.

4. Error Handling:

Implement error handling mechanisms within the loops using Try Catch activities to gracefully handle potential issues during file/folder retrieval.
phemanth 6,810 Reputation points Microsoft Vendor

2024-05-16T10:57:30.1033333+00:00

@Mansi Yadav We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
phemanth 6,810 Reputation points Microsoft Vendor

2024-05-20T09:44:23.4666667+00:00
@Mansi Yadav

Certainly, here is an expression you can use to normalize the filename in Azure Data Factory (ADF):

def normalize_filename(filename): """ This function removes special characters from a filename and replaces them with underscores. Args: filename (str): The filename to normalize. Returns: str: The normalized filename. """ valid_chars = "-_.() {}[]abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" new_filename = ''.join(c for c in filename if c in valid_chars) return new_filename # Example usage filename = "This is a file^&*!@#$%^{}.txt" normalized_filename = normalize_filename(filename) print(normalized_filename) # Output: This_is_a_file_txt

This function removes all characters except alphanumeric characters, underscores, hyphens, periods, and parentheses. It replaces any remaining special characters with underscores.

Here's how you can use the normalize_filename function in your ADF copy activity:

Add a Set Variable activity before your copy activity.

In the Set Variable activity, set the name of the variable to something like normalizedFilename.

In the Value field of the Set Variable activity, use the following expression:

Code snippet

@concat('''', normalize_filename(item().name), '''')

This expression will call the normalize_filename function with the current item's name (which should contain the filename) and return the normalized filename. The @concat function is used to add single quotes around the filename, which is required by ADF.

In your copy activity, use the @{variables('normalizedFilename')} expression instead of the original filename parameter.

By following these steps, you can ensure that your filenames are properly formatted and will not cause any errors in your ADF copy activity.

Special Characters Replaced With

&, @, copyright icon, TM, etc. Underscore (_)


Special Characters	Replaced With
&, @, copyright icon, TM, etc.	Underscore (_)

Share via

How can I get all files inside a drive irrespective of folder structure in ADF?