How can I get all files inside a drive irrespective of folder structure in ADF?
I want to copy files files from sharepoint drive which has lots of nested folders. Maximum hierarchy of folders is for 12 levels. Currently I'm using below endpoint in ADF's Web activity as it was mentioned in some of the articles that it provides every element inside the drive at any folder level.
https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true
I was handling odata. next link also. But when I executed the pipeline, it was running fine but after 5th cycle I noticed that results were somewhat similar to the previous one. It copied only 600 files out of 50k files.
Please suggest me one endpoint that can easily fetch all the files or folders at different folder level.
Azure Data Lake Storage
Azure Data Factory
-
phemanth 6,810 Reputation points • Microsoft Vendor
2024-05-14T06:45:14.24+00:00 Welcome to the Microsoft Q&A platform and thank you for posting your question here.
While there isn't a single endpoint that directly provides this functionality, here are two effective approaches you can combine:
1. Leverage
Get Metadata
Activity for Folder Structure:- Use the
Get Metadata
activity to retrieve the folder hierarchy within the SharePoint drive. This activity retrieves metadata about folders and subfolders, enabling you to iterate through the structure. - Configure the
Get Metadata
activity with the following settings: - Dataset: Point to your SharePoint drive dataset.
- Folder Path: Set it to the root folder path of your SharePoint drive (e.g.,
/sites/yoursite/Shared Documents
). - Recursive: Set to
true
to include subfolders in the metadata retrieval. - List Children: Set to
true
to obtain details about child folders and files.
2. Utilize a Loop to Process Files Recursively:
- Employ a
For Each
activity to iterate over the folder structure retrieved byGet Metadata
. - Inside the loop, use a
Web Activity
to call the Microsoft Graph API endpoint you mentioned:https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true
- Access the current folder path within the loop using
@item().folderPath
(replaceitem()
with the appropriate variable name based on your ADF version). - Optionally, include additional filtering criteria (e.g., file type) within the
$filter
parameter. - Handle pagination using the
odata.nextLink
property returned by the API to retrieve all files.
Additional Considerations:
- Error Handling: Implement error handling mechanisms within the loop to gracefully handle potential issues during file retrieval.
- Performance Optimization: For very large drives, consider techniques like batching file requests or leveraging Azure Data Explorer for efficient data processing.
Hope this helps. Do let us know if you any further queries.
- Use the
-
Mansi Yadav 20 Reputation points
2024-05-14T15:31:33.5533333+00:00 Can you help me in creating an iterative approach. Let's say I have a drive now I'm using endpoint drives/drive-id/root/children, Now I have all initial folders and files, I can easily filter out all the files and copy them, if there is odata.nextLink, I can add a until activity to deal with it. But now my doubt is how to deal with folders. I can use drives/drive-id/items/folder-id/children this will give me next file and folder but how to iterate with next level and so on. Thi is my doubt
-
phemanth 6,810 Reputation points • Microsoft Vendor
2024-05-15T06:40:32.54+00:00 Here's how you can create an iterative approach in ADF to copy files from SharePoint with nested folders (up to 12 levels) using the approach outlined previously:
1. Get Initial Folders and Files:
Use a Get Metadata activity with the following settings:
- Dataset: Point to your SharePoint drive dataset.
- Folder Path: Set it to the root folder path (e.g.,
/sites/yoursite/Shared Documents
). - Recursive: Set to
true
to include subfolders. - List Children: Set to
true
to obtain details about child folders and files.
This retrieves all folders (including nested ones) and files at the root level.
2. Loop and Process Folders:
- Use a For Each activity to iterate through the output of the Get Metadata activity. This loop will process each folder (and its descendants).
- Inside the loop:
- Identify Folder Type: Check if the current item (
@item()
) is a folder using@item().folder
property (true for folders, false for files). - Process Files (if folder is not encountered):
- If it's not a folder, use a Web Activity to call the existing endpoint:
https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true
- Update the folder path dynamically using
@item().folderPath
to target the current folder. - Use a Copy Data activity to copy the retrieved files to Azure Data Lake Storage.
- Implement pagination using
odata.nextLink
returned by the API to ensure all files within the folder are copied.
- If it's not a folder, use a Web Activity to call the existing endpoint:
- Process Folders (if folder is encountered):
- Use another Web Activity to call the endpoint for retrieving children:
https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/@{item().id}/children/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true
- This retrieves the child items (folders and files) within the current folder.
- Use a Set Variable activity to create a new variable (e.g.,
childItems
) and store the output of the web activity (folder and file details). - Use another For Each activity to iterate through the
childItems
variable. This inner loop will process the children of the current folder, recursively handling nested structures.
- Use another Web Activity to call the endpoint for retrieving children:
3. Handle Pagination (Optional):
- Within both file and child folder processing loops, check for the
odata.nextLink
property in the web activity output. - If
odata.nextLink
exists, use a Set Variable activity to update a variable holding the next link. - Add an Until activity with the following condition:
@not(empty(variables('nextLink')))
. - Inside the Until loop, use another Web Activity to call the
odata.nextLink
endpoint to retrieve the next set of files/folders and continue processing.
4. Error Handling:
- Implement error handling mechanisms within the loops using Try Catch activities to gracefully handle potential issues during file/folder retrieval.
-
phemanth 6,810 Reputation points • Microsoft Vendor
2024-05-16T10:57:30.1033333+00:00 @Mansi Yadav We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
-
phemanth 6,810 Reputation points • Microsoft Vendor
2024-05-20T09:44:23.4666667+00:00 Certainly, here is an expression you can use to normalize the filename in Azure Data Factory (ADF):
def normalize_filename(filename): """ This function removes special characters from a filename and replaces them with underscores. Args: filename (str): The filename to normalize. Returns: str: The normalized filename. """ valid_chars = "-_.() {}[]abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" new_filename = ''.join(c for c in filename if c in valid_chars) return new_filename # Example usage filename = "This is a file^&*!@#$%^{}.txt" normalized_filename = normalize_filename(filename) print(normalized_filename) # Output: This_is_a_file_txt
This function removes all characters except alphanumeric characters, underscores, hyphens, periods, and parentheses. It replaces any remaining special characters with underscores.
Here's how you can use the normalize_filename function in your ADF copy activity:
- Add a Set Variable activity before your copy activity.
- In the Set Variable activity, set the name of the variable to something like
normalizedFilename
. - In the Value field of the Set Variable activity, use the following expression:
Code snippet
@concat('''', normalize_filename(item().name), '''')
This expression will call the
normalize_filename
function with the current item's name (which should contain the filename) and return the normalized filename. The@concat
function is used to add single quotes around the filename, which is required by ADF.- In your copy activity, use the
@{variables('normalizedFilename')}
expression instead of the original filename parameter.
By following these steps, you can ensure that your filenames are properly formatted and will not cause any errors in your ADF copy activity.
Special Characters Replaced With &, @, copyright icon, TM, etc. Underscore (_)
Sign in to comment