Datafactory ~ Get latest file in container

kryten68 31 Reputation points
2021-03-06T10:50:15.177+00:00

I think this question may have come up before but I have not been able to locate a clear or clean answer which properly addresses the situation. Given the likelihood for how common this operation must be I can't help but feel that there must be a good way to do this without layers of complexity.

Scenario:
I have a storage account container into which an application is sending data files at varying rates. If the application is experiencing demand it may push a data file into that container once a minute. If the application is not in demand the data files may be written one every couple of minutes. In either extreme for the scenario the metadata for the file is the source of truth: the latest file, the one I need to process, has the most up to date last modified time in the meta data.

The patterns I see for 'getting' the latest file all require you to get the metadata.childitems of the container, then link that to a foreach which then gets the metadata for each file in turn, this time you can get at the last modified property. It is also possible to filter that with the filter by last modified 'Start time' and 'End time'.

This pattern does not seem to work for me because if the application is busy and I am using that filter to return files from E.g. the last two minutes I may well get back two or even three files. I need to use the last couple of minutes as a window because if the application has not been busy then I need that window to catch the last single file (which might have been two minutes ago).

I thought that the filter activity, would have provided a 'sort' function to allow me to at least sort the files that I get back (from the two minute window) then select or filter-in the latest one, but that does not seem to be possible.

Fundamentally, all I am looking for is a solid pattern for getting the single newest file in that container - surely that is a common use case?

At the moment, I am thinking it may be necessary for the pipeline to call a function app which does the work of locating the file I need, then renaming it. That way all Datafactory would need is a dataset pointing to that filename - but I can't help but feel accomplishing something this simple should be do-able in Datafactory itself without having to make a function app (just for that).

I would be very grateful for any suggestions on how this can be accomplished.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,656 questions
0 comments No comments
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,036 Reputation points
    2021-03-09T01:00:14.683+00:00

    Hello @kryten68 and welcome to Microsoft Q&A.

    The short answer is, right now (2021-03-08), there is not a feature to do this easily.

    Variations on "get the most recent file" are a common ask. You may have come across some of my solutions in your research. The definition of what is "the most recent" fil change from ask to ask. Some define it by a datetime in the filename. Some define it by directories written like a date. Some define it as the most recently created and others like yourself define it as the most recently updated. All these multiplied by the myriad datastores make it not trivial.

    Still, I feel this would be worth a feature to make this easier. I will make an inquiry whether there are plans for a feature to ease your use case. To help encourage development of such a feature, please upvote in the feedback forum.

    The longer answer is a different approach. Utilize Event Grid so that when a new file is written or updated, the filename is written to another file in a fixed locations. Then DataFactory can use Lookup activity on this file to get the name of the most recent file.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful